Re: [HACKERS] initdb -S and tablespaces

2015-05-09 Thread Tom Lane
Andres Freund  writes:
> On 2015-05-08 22:08:31 -0400, Robert Haas wrote:
>> Of course, even the last one isn't totally bullet-proof.  Suppose one
>> backend fails to absorb the new setting for some reason...

> I've a hard time worrying much about that one...

You should.  At the very least, whatever recipe we write for changing
fsync safely has to include a clause like "wait for all postmaster
children to have absorbed the new fsync setting".  The facts that (a) this
could be a long time and (b) there's no easy way to be entirely certain
about when it's done don't make it something you should ignore.

I wonder whether we should change fsync to be PGC_POSTMASTER and then
document the safe procedure as requiring a postmaster restart.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-09 Thread Andres Freund
On 2015-05-08 22:08:31 -0400, Robert Haas wrote:
> That seems a bit better.  I think it's really important, if we're
> going to start to try to make fsync=off anything other than a toy,

I think it's long past that. I've seen many, many people use it during
initial data loading.

> that we document really clearly the circumstances in which it is or is
> not safe:

Yea, we really should have done that a long time ago.

> - If you crash while fsync=off, your cluster may be corrupted.

HW crash, right?

> - If you crash while fsync=on, but it was off at the last checkpoint,
> your cluster may be corrupted.
> - If you turn fsync=off, do stuff, turn fsync=on, and checkpoint
> successfully, a subsequent crash should not corrupt anything.

Yep.

> Of course, even the last one isn't totally bullet-proof.  Suppose one
> backend fails to absorb the new setting for some reason...

I've a hard time worrying much about that one...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-08 Thread Robert Haas
On Fri, May 8, 2015 at 7:53 PM, Andres Freund  wrote:
> On 2015-05-04 14:23:16 -0400, Robert Haas wrote:
>> On Fri, May 1, 2015 at 10:41 AM, Abhijit Menon-Sen  
>> wrote:
>> > As for the non-backpatchable 0002, I agree with Andres that it should be
>> > included in 9.5; but I take it you're still not convinced?
>>
>> No, I'm not convinced.  That patch will protect you in one extremely
>> specific scenario: you turn off fsync, do some stuff, shut down, turn
>> fsync back on again, and start up.
>
> Hm. ISTM it'd not be hard to actually make it safe in nearly all
> situations. What about tracking the last checkpoint's fsync setting and
> do a fsync_pgdata() in the checkpointer at the end of a checkpointer if
> the previous setting was off and the current one is on?  Combined with
> doing so at startup if the settings changed between runs, that should
> give pretty decent protection. And seems fairly simple to implement.

That seems a bit better.  I think it's really important, if we're
going to start to try to make fsync=off anything other than a toy,
that we document really clearly the circumstances in which it is or is
not safe:

- If you crash while fsync=off, your cluster may be corrupted.
- If you crash while fsync=on, but it was off at the last checkpoint,
your cluster may be corrupted.
- If you turn fsync=off, do stuff, turn fsync=on, and checkpoint
successfully, a subsequent crash should not corrupt anything.

Of course, even the last one isn't totally bullet-proof.  Suppose one
backend fails to absorb the new setting for some reason...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-08 Thread Andres Freund
On 2015-05-04 14:23:16 -0400, Robert Haas wrote:
> On Fri, May 1, 2015 at 10:41 AM, Abhijit Menon-Sen  
> wrote:
> > As for the non-backpatchable 0002, I agree with Andres that it should be
> > included in 9.5; but I take it you're still not convinced?
> 
> No, I'm not convinced.  That patch will protect you in one extremely
> specific scenario: you turn off fsync, do some stuff, shut down, turn
> fsync back on again, and start up.

Hm. ISTM it'd not be hard to actually make it safe in nearly all
situations. What about tracking the last checkpoint's fsync setting and
do a fsync_pgdata() in the checkpointer at the end of a checkpointer if
the previous setting was off and the current one is on?  Combined with
doing so at startup if the settings changed between runs, that should
give pretty decent protection. And seems fairly simple to implement.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-05 Thread Robert Haas
On Tue, May 5, 2015 at 6:26 AM, David Rowley  wrote:
> On 5 May 2015 at 06:23, Robert Haas  wrote:
>>
>> OK, committed and back-patched.
>
> There's a couple of problems with this that the windows buildfarm members
> are not happy with.
>
> The attached patch seems to work locally.

Thanks.  I think the open() stuff should be fixed by using
BasicOpenFile() rather than introducing support for the two-argument
form of open().

I'll push a fix shortly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-05 Thread David Rowley
On 5 May 2015 at 06:23, Robert Haas  wrote:

>
> OK, committed and back-patched.
>

There's a couple of problems with this that the windows buildfarm members
are not happy with.

The attached patch seems to work locally.

Regards

David Rowley


fsync_win32_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-04 Thread Robert Haas
On Fri, May 1, 2015 at 10:41 AM, Abhijit Menon-Sen  wrote:
> At 2015-05-01 09:57:28 -0400, robertmh...@gmail.com wrote:
>>
>> If you don't object to this version, I'll commit it.
>
> Looks fine to me, thank you.

OK, committed and back-patched.

> As for the non-backpatchable 0002, I agree with Andres that it should be
> included in 9.5; but I take it you're still not convinced?

No, I'm not convinced.  That patch will protect you in one extremely
specific scenario: you turn off fsync, do some stuff, shut down, turn
fsync back on again, and start up.  But it won't protect you if you
crash while fsync is off, or after you shut down with fsync=off and
before you restart with fsync=on.  And there's no documentation change
here that would help anyone distinguish between the situations in
which they are protected and the situations in which they are not
protected.  Without that, a lot of people are going to get this wrong.

As an alternative, how about fsync=shutdown parameter?  This could be
documented to fsync the data directory at shutdown.  It could document
that there is a risk of corruption if the server crashes, but that the
database is OK if shut down cleanly.  fsync=off could document that
you must run initdb --sync-only after shutting down, else you are
unsafe.

I'm not wedded to any particular solution, but an undocumented hack
that some people will manage to use safely some of the time doesn't
seem good enough to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Abhijit Menon-Sen
At 2015-05-01 09:57:28 -0400, robertmh...@gmail.com wrote:
>
> If you don't object to this version, I'll commit it.

Looks fine to me, thank you.

As for the non-backpatchable 0002, I agree with Andres that it should be
included in 9.5; but I take it you're still not convinced? Should I add
that to the CF separately for discussion, or what?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Andres Freund
Hi,

I agree that splitting the patch into two separate ones is a good one.

On 2015-05-01 09:57:28 -0400, Robert Haas wrote:
> If you don't object to this version, I'll commit it.  I believe this
> part *should* be back-patched, but Tom seemed to disagree, for reasons
> I'm not really clear on.  This is a potential data corrupting bug as
> legitimate as any other, so a back-patch seems right to me.

Agreed. Especially for WAL files this seems to be a pretty clear
correctness issue to me.

I unsurprisingly think the other patch is a good idea too. But it's
clearly *not* something for the back branches.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Robert Haas
On Fri, May 1, 2015 at 8:42 AM, Abhijit Menon-Sen  wrote:
> At 2015-05-01 08:10:16 -0400, robertmh...@gmail.com wrote:
>> It seems to me that, at a minimum, it would be good to split those
>> controversial and definitely not-back-patchable changes into their
>> own patch.
>
> OK, split here (0002*).
>
>> I do mind putting it into xlog.c instead of some place that's actually
>> appropriate.
>
> OK, moved to storage/file/fd.c (0001*).

Here's a revised version of your 0001 patch which I am comfortable
with.  I changed some of the comments, and I moved the fsync_pgdata()
call slightly later, so that we don't do a (possibly long) set of
fsyncs before printing out the first log message that tells the user
what is going on.

If you don't object to this version, I'll commit it.  I believe this
part *should* be back-patched, but Tom seemed to disagree, for reasons
I'm not really clear on.  This is a potential data corrupting bug as
legitimate as any other, so a back-patch seems right to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


fsync-pgdata-rmh.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Abhijit Menon-Sen
At 2015-05-01 08:10:16 -0400, robertmh...@gmail.com wrote:
>
> It seems to me that, at a minimum, it would be good to split those
> controversial and definitely not-back-patchable changes into their
> own patch.

OK, split here (0002*).

> I do mind putting it into xlog.c instead of some place that's actually
> appropriate.

OK, moved to storage/file/fd.c (0001*).

-- Abhijit
>From 088b80eb0825339eca688e4347a4ef547edcadbb Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: Recursively fsync PGDATA at startup after a crash

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c |  49 +
 src/backend/storage/file/fd.c | 112 ++
 src/include/storage/fd.h  |   2 +
 3 files changed, 163 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6cf4415..084174d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -845,6 +845,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void fsync_pgdata(char *datadir);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -5878,6 +5880,25 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg("control file contains invalid data")));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 *
+	 * We also do this if the control file indicates that fsync was
+	 * disabled at some point while the server was running earlier.
+	 */
+
+	if (enableFsync &&
+		(ControlFile->state != DB_SHUTDOWNED &&
+		 ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY))
+	{
+		fsync_pgdata(data_directory);
+	}
+
 	if (ControlFile->state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -11123,3 +11144,31 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Issue fsync recursively on PGDATA and all its contents.
+ */
+static void
+fsync_pgdata(char *datadir)
+{
+	if (!enableFsync)
+		return;
+
+	/*
+	 * If possible, hint to the kernel that we're soon going to fsync
+	 * the data directory and its contents.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE) || \
+	(defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED))
+	walkdir(datadir, pre_sync_fname);
+#endif
+
+	/*
+	 * Now we do the fsync()s in the same order.
+	 *
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.
+	 */
+	walkdir(datadir, fsync_fname);
+}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f796717..aba12ca 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2439,3 +2439,115 @@ looks_like_temp_rel_name(const char *name)
 		return false;
 	return true;
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+void
+pre_sync_fname(char *fname, bool isdir)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
+		return;
+
+	if (fd < 0)
+		ereport(FATAL,
+(errmsg("could not open file \"%s\" before fsync",
+		fname)));
+
+	pg_flush_data(fd, 0, 0);
+
+	close(fd);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself)
+ * and following symbolic links.
+ */
+void
+walkdir(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(path);
+	while ((de = ReadDir(dir, path)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (lstat(subpath, &fst) < 0)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not stat file \"%s\": %m", subpath)));
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false);
+		else if (S_ISDIR(fst.

Re: [HACKERS] initdb -S and tablespaces

2015-05-01 Thread Robert Haas
On Thu, Apr 30, 2015 at 11:29 PM, Abhijit Menon-Sen  
wrote:
>> 2. I don't know why it's part of this patch.
>
> In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his
> rationale as follows:
>
> «What I am thinking of is that, currently, if you start the server
> for initial loading with fsync=off, and then restart it, you're open
> to data loss. So when the current config file setting is changed
> from off to on, we should fsync the data directory. Even if there
> was no crash restart.»

That's awfully clever, but I'm not sure I like the idea of trying to
be that clever.  I think if users temporarily disable fsync, they
should be responsible for using initdb -S after if that is needed in
their situation, and this should be documented.

It seems to me that, at a minimum, it would be good to split those
controversial and definitely not-back-patchable changes into their own
patch.

>> Also, it seems awfully unfortunate to me that we're duplicating a
>> whole pile of code into xlog.c here.
>
> I have pointed out and discussed the duplication several times. I did it
> this way only because we were considering backporting the changes, and
> at the time it seemed better to do this and fix the duplication in a
> separate patch.

As I've mentioned a few times, I don't mind duplicating the code if we
have frontend and backend versions that are materially different.  But
I do mind putting it into xlog.c instead of some place that's actually
appropriate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Abhijit Menon-Sen
At 2015-04-30 16:56:17 -0700, t...@sss.pgh.pa.us wrote:
>
> As for the notion that this needs to be back-patched, I would say no.

Not even just the "fsync after crash" part? I could separate that out
from the control file changes and try to eliminate the duplication. I
think that would be worth back-patching, at least.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Abhijit Menon-Sen
At 2015-04-30 15:37:44 -0400, robertmh...@gmail.com wrote:
>
> 1. It doesn't do that.  As soon as we fsync the data directory, we
>reset the flag.  That's not what "ever disabled" means to me.

Could you suggest an acceptable alternative wording? I can't immediately
think of anything better than "disabled since the last restart". That is
conditional on our resetting the flag, which we will do only if fsync is
enabled at startup. So it's true, but not the whole truth.

> 2. I don't know why it's part of this patch.

In 20150115133245.gg5...@awork2.anarazel.de, Andres explained his
rationale as follows:

«What I am thinking of is that, currently, if you start the server
for initial loading with fsync=off, and then restart it, you're open
to data loss. So when the current config file setting is changed
from off to on, we should fsync the data directory. Even if there
was no crash restart.»

That's what I tried to implement.

> Also, it seems awfully unfortunate to me that we're duplicating a
> whole pile of code into xlog.c here.

I have pointed out and discussed the duplication several times. I did it
this way only because we were considering backporting the changes, and
at the time it seemed better to do this and fix the duplication in a
separate patch.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Tom Lane
Robert Haas  writes:
> On Thu, Apr 30, 2015 at 6:44 PM, Alvaro Herrera
>  wrote:
>> Ah, so that's not the duplicate code that I was remembering -- I think
>> it's walkdir() or something like that, which is in initdb IIRC.

> Yeah, walkdir() is there too.  But if we're going to add that to the
> backend, I think it should go in src/backend/storage/file, not
> src/backend/access/transam.

Agreed that .../transam is a pretty horrid choice; but maybe we should
be thinking about putting it in src/common, so there's only one copy?

As for the notion that this needs to be back-patched, I would say no.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Robert Haas
On Thu, Apr 30, 2015 at 6:44 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Thu, Apr 30, 2015 at 6:18 PM, Alvaro Herrera
>>  wrote:
>> >> Also, it seems awfully unfortunate to me that we're duplicating a
>> >> whole pile of code into xlog.c here.  Maybe there's no way to avoid
>> >> the code duplication, but pre_sync_fname() seems like it'd more
>> >> naturally go in fd.c than here.  I dunno where walkdir should go, but
>> >> again, not in xlog.c.
>> >
>> > Hm, there's an interest in backpatching this as a bugfix, if I
>> > understand correctly; hence the duplicated code.  We could remove the
>> > duplicity later with a refactoring patch in master only.
>>
>> That seems pretty silly.  If we going to add pre_sync_fname() to every
>> branch, we should add it to the same (correct) file in all of them,
>> not put it in xlog.c in the back-branches and fd.c in master.
>
> Ah, so that's not the duplicate code that I was remembering -- I think
> it's walkdir() or something like that, which is in initdb IIRC.

Yeah, walkdir() is there too.  But if we're going to add that to the
backend, I think it should go in src/backend/storage/file, not
src/backend/access/transam.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Alvaro Herrera
Robert Haas wrote:
> On Thu, Apr 30, 2015 at 6:18 PM, Alvaro Herrera
>  wrote:
> >> Also, it seems awfully unfortunate to me that we're duplicating a
> >> whole pile of code into xlog.c here.  Maybe there's no way to avoid
> >> the code duplication, but pre_sync_fname() seems like it'd more
> >> naturally go in fd.c than here.  I dunno where walkdir should go, but
> >> again, not in xlog.c.
> >
> > Hm, there's an interest in backpatching this as a bugfix, if I
> > understand correctly; hence the duplicated code.  We could remove the
> > duplicity later with a refactoring patch in master only.
> 
> That seems pretty silly.  If we going to add pre_sync_fname() to every
> branch, we should add it to the same (correct) file in all of them,
> not put it in xlog.c in the back-branches and fd.c in master.

Ah, so that's not the duplicate code that I was remembering -- I think
it's walkdir() or something like that, which is in initdb IIRC.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Robert Haas
On Thu, Apr 30, 2015 at 6:18 PM, Alvaro Herrera
 wrote:
>> Also, it seems awfully unfortunate to me that we're duplicating a
>> whole pile of code into xlog.c here.  Maybe there's no way to avoid
>> the code duplication, but pre_sync_fname() seems like it'd more
>> naturally go in fd.c than here.  I dunno where walkdir should go, but
>> again, not in xlog.c.
>
> Hm, there's an interest in backpatching this as a bugfix, if I
> understand correctly; hence the duplicated code.  We could remove the
> duplicity later with a refactoring patch in master only.

That seems pretty silly.  If we going to add pre_sync_fname() to every
branch, we should add it to the same (correct) file in all of them,
not put it in xlog.c in the back-branches and fd.c in master.

> However, now that you mention a pg_control flag, it becomes evident to
> me that a change to pg_control cannot be back-patched ...

Indeed.  But I think we can solve that problem by just ripping that
part out.  Unless I'm missing something, it's not really doing
anything useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Alvaro Herrera
Robert Haas wrote:

> Also, it seems awfully unfortunate to me that we're duplicating a
> whole pile of code into xlog.c here.  Maybe there's no way to avoid
> the code duplication, but pre_sync_fname() seems like it'd more
> naturally go in fd.c than here.  I dunno where walkdir should go, but
> again, not in xlog.c.

Hm, there's an interest in backpatching this as a bugfix, if I
understand correctly; hence the duplicated code.  We could remove the
duplicity later with a refactoring patch in master only.

However, now that you mention a pg_control flag, it becomes evident to
me that a change to pg_control cannot be back-patched ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-30 Thread Robert Haas
On Thu, Apr 16, 2015 at 9:24 AM, Abhijit Menon-Sen  wrote:
> Here's a variation of the earlier patch that follows all links in
> PGDATA. Does this look more like what you had in mind?

I'm really confused by the additional control-file field.  It is
documented as indicating whether fsync was ever disabled while the
server was running.  But:

1. It doesn't do that.  As soon as we fsync the data directory, we
reset the flag.  That's not what "ever disabled" means to me.

2. I don't know why it's part of this patch.  Tracking whether fsync
was ever disabled could be useful forensically, but isn't related to
fsync-ing the data directory after a crash, so I dunno why we'd put
that in this patch.  Tracking whether fsync was disabled recently, as
the patch actually does, doesn't seem to be of any obvious value in
preventing corruption either.

Also, it seems awfully unfortunate to me that we're duplicating a
whole pile of code into xlog.c here.  Maybe there's no way to avoid
the code duplication, but pre_sync_fname() seems like it'd more
naturally go in fd.c than here.  I dunno where walkdir should go, but
again, not in xlog.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-16 Thread Abhijit Menon-Sen
Hi.

Here's a variation of the earlier patch that follows all links in
PGDATA. Does this look more like what you had in mind?

-- Abhijit
>From d86888b0d2f5a3a57027d26ce050a3bbb58670d3 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: Recursively fsync PGDATA at startup if needed

It's needed if we need to perform crash recovery or if fsync was
disabled at some point while the server was running earlier (and
we must store that information in the control file).

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c   | 168 
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/include/catalog/pg_control.h|   8 +-
 3 files changed, 177 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..263d2d0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -835,6 +835,8 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void fsync_pgdata(char *datadir);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -4811,6 +4813,7 @@ BootStrapXLOG(void)
 	ControlFile->checkPoint = checkPoint.redo;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->unloggedLSN = 1;
+	ControlFile->fsync_disabled = false;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5868,6 +5871,27 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg("control file contains invalid data")));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 *
+	 * We also do this if the control file indicates that fsync was
+	 * disabled at some point while the server was running earlier.
+	 */
+
+	if (enableFsync &&
+		(ControlFile->fsync_disabled ||
+		 (ControlFile->state != DB_SHUTDOWNED &&
+		  ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)))
+	{
+		fsync_pgdata(data_directory);
+		ControlFile->fsync_disabled = false;
+	}
+
 	if (ControlFile->state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -8236,6 +8260,8 @@ CreateCheckPoint(int flags)
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
 	ControlFile->minRecoveryPointTLI = 0;
+	if (!enableFsync)
+		ControlFile->fsync_disabled = true;
 
 	/*
 	 * Persist unloggedLSN value. It's reset on crash recovery, so this goes
@@ -11107,3 +11133,145 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+static void
+pre_sync_fname(char *fname, bool isdir)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
+		return;
+
+	if (fd < 0)
+		ereport(FATAL,
+(errmsg("could not open file \"%s\" before fsync",
+		fname)));
+
+	pg_flush_data(fd, 0, 0);
+
+	close(fd);
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself)
+ * and following symbolic links.
+ */
+static void
+walkdir(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(path);
+	while ((de = ReadDir(dir, path)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (lstat(subpath, &fst) < 0)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not stat file \"%s\": %m", subpath)));
+
+		if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false);
+		else if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action);
+#ifndef WIN32
+		else if (S_ISLNK(fst.st_mode))
+#else
+		else if (pg_win32_is_junction(subpath))
+#endif
+		{
+#if defined(HAVE_READLINK) || defined(WIN32)
+			char		linkpath[MAXPGPATH];
+			int			len;
+			str

Re: [HACKERS] initdb -S and tablespaces

2015-04-14 Thread Abhijit Menon-Sen
At 2015-04-15 11:40:34 +0530, a...@2ndquadrant.com wrote:
>
> Still, thanks to the example in basebackup.c, I've got most of a patch
> that should follow any symlinks in PGDATA.

I notice that src/bin/pg_rewind/copy_fetch.c has a traverse_datadir()
function that does what we want (but it recurses into symlinks only
inside pg_tblspc).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-14 Thread Abhijit Menon-Sen
At 2015-04-06 10:16:36 -0300, alvhe...@2ndquadrant.com wrote:
>
> Well, we have many things that can be set as symlinks; some you can
> have initdb create the links for you (initdb --xlogdir), others you
> can move manually.

Looking at sendDir() in backend/replication/basebackup.c, I notice that
the only places where it expects/allows symlinks are for pg_xlog and in
pg_tblspc.

Still, thanks to the example in basebackup.c, I've got most of a patch
that should follow any symlinks in PGDATA. I just need to test a little
more before I post it.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-06 Thread Alvaro Herrera
Abhijit Menon-Sen wrote:

Hi,

> At 2015-04-03 13:32:32 -0300, alvhe...@2ndquadrant.com wrote:

> > I also noticed that walkdir() thinks it is completely agnostic on
> > what the passed action is; except that the comment at the bottom talks
> > about how fsync on directories is important, which seems out of place.
> 
> Yes. The function behaves as documented, but the comment is clearly too
> specific. I'm not sure where to put it. I could make walkdir() NOT do
> it, and instead do it in the caller with the same comment. Thoughts?

I think it's enough to state in the function comment that the action is
applied to the top element too.  Maybe add the fsync comment on the
callsite.

> > Hmm ... Actually, since surely we must follow symlinks everywhere,
> > why do we have to do this separately for pg_tblspc? Shouldn't that
> > link-following occur automatically when walking PGDATA in the first
> > place?
> 
> I'm not sure about that (and that's why I've not attached an updated
> patch here). The original idea was to follow only those links that we
> expect to be in PGDATA.
> 
> I suppose it would be easier in terms of the code to follow all links,
> but I don't know if it's the right thing. If that's what you think we
> should do, I can post a simplified patch.

Well, we have many things that can be set as symlinks; some you can have
initdb create the links for you (initdb --xlogdir), others you can move
manually.  I think not following all links might lead to impossible-to-
detect bugs such as failing to fsync new pgdata subdirectories we add in
the future, for example.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-06 Thread Abhijit Menon-Sen
Hi Álvaro.

Thanks for taking a look at the patch.

At 2015-04-03 13:32:32 -0300, alvhe...@2ndquadrant.com wrote:
>
> But then, since the name is already telling us that it's all about
> PGDATA, maybe we can remove the "recursively" part of the name?

Sure, that makes sense too. Since you and Andres both like
«fsync_pgdata(data_directory)», I'll change it accordingly.

> I also noticed that walkdir() thinks it is completely agnostic on
> what the passed action is; except that the comment at the bottom talks
> about how fsync on directories is important, which seems out of place.

Yes. The function behaves as documented, but the comment is clearly too
specific. I'm not sure where to put it. I could make walkdir() NOT do
it, and instead do it in the caller with the same comment. Thoughts?

> Hmm ... Actually, since surely we must follow symlinks everywhere,
> why do we have to do this separately for pg_tblspc? Shouldn't that
> link-following occur automatically when walking PGDATA in the first
> place?

I'm not sure about that (and that's why I've not attached an updated
patch here). The original idea was to follow only those links that we
expect to be in PGDATA.

I suppose it would be easier in terms of the code to follow all links,
but I don't know if it's the right thing. If that's what you think we
should do, I can post a simplified patch.

> Maybe fsync_recursively should Assert() that it's only being called
> with enableFsync=on; or perhaps we can have it return early if it's
> unset.

I prefer the latter. Will change.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-04-03 Thread Alvaro Herrera
Abhijit Menon-Sen wrote:
> At 2015-01-15 14:32:45 +0100, and...@2ndquadrant.com wrote:

> Patch attached.
> 
> Changes:
> 
> 1. Renamed perform_fsync to fsync_recursively (otherwise it would read
>"fsync_pgdata(pg_data)")

Okay, but as far as I can tell this function is very specific to
PGDATA; you couldn't use it in any other directory (or pg_tblspc would
be missing and that would cause an error, right?)  Therefore I think it
would make sense to have the name reflect this; maybe
fsync_datadir_recursively(data_directory)
or
fsync_pgdata_recursively(data_directory)
would work?  But then, since the name is already telling us that it's
all about PGDATA, maybe we can remove the "recursively" part of the
name?  Not sure about any of this; other opinions?

I also noticed that walkdir() thinks it is completely agnostic on what
the passed action is; except that the comment at the bottom talks about
how fsync on directories is important, which seems out of place.

I wonder about walktblspc_links() too.  Seems to me that that function
is pretty much the same as walkdir(); would it work to add a flag to the
latter to change the behavior in whatever way needs to be changed, and
remove the former?  Hmm ... Actually, since surely we must follow
symlinks everywhere, why do we have to do this separately for pg_tblspc?
Shouldn't that link-following occur automatically when walking PGDATA in
the first place?


> 2. Added ControlData->fsync_disabled to record whether fsync was ever
>disabled while the server was running (tested in CreateCheckPoint)
> 3. Run fsync_recursively at startup only if fsync is enabled
> 4. Run it if we're doing crash recovery, or fsync was disabled
> 5. Use pg_flush_data in pre_sync_fname
> 6. Issue fsync on directories too
> 7. Tested that it works if pg_xlog is a symlink (no changes).
> 
> (In short, everything you mentioned in your earlier mail.)
> 
> Note that I set ControlData->fsync_disabled to false in BootstrapXLOG,
> but it gets set to true during a later CreateCheckPoint(). This means
> we run fsync again at startup after initdb. I'm not sure what to do
> about that.

This all looks reasonable to me.  I just noticed, though, that
the fd.c routines test enableFsync and do nothing if it's not enabled;
but fsync_recursively goes to all the trouble of doing stuff even if
disabled, and the actions are skipped later; the enableFsync check is
then responsibility of the caller.  This seems a bit prone to later
confusion.  Maybe fsync_recursively should Assert() that it's only being
called with enableFsync=on; or perhaps we can have it return early if
it's unset.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-03-10 Thread Abhijit Menon-Sen
At 2015-01-15 14:32:45 +0100, and...@2ndquadrant.com wrote:
>
> What I am thinking of is that, currently, if you start the server for
> initial loading with fsync=off, and then restart it, you're open to
> data loss. So when the current config file setting is changed from off
> to on, we should fsync the data directory. Even if there was no crash
> restart.

Patch attached.

Changes:

1. Renamed perform_fsync to fsync_recursively (otherwise it would read
   "fsync_pgdata(pg_data)")
2. Added ControlData->fsync_disabled to record whether fsync was ever
   disabled while the server was running (tested in CreateCheckPoint)
3. Run fsync_recursively at startup only if fsync is enabled
4. Run it if we're doing crash recovery, or fsync was disabled
5. Use pg_flush_data in pre_sync_fname
6. Issue fsync on directories too
7. Tested that it works if pg_xlog is a symlink (no changes).

(In short, everything you mentioned in your earlier mail.)

Note that I set ControlData->fsync_disabled to false in BootstrapXLOG,
but it gets set to true during a later CreateCheckPoint(). This means
we run fsync again at startup after initdb. I'm not sure what to do
about that.

Is this about what you had in mind?

-- Abhijit
>From bb2b5f130525dd44a10eab6829b9802b8f6f7eed Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: Recursively fsync PGDATA at startup if needed

It's needed if we need to perform crash recovery or if fsync was
disabled at some point while the server was running earlier (and
we must store that information in the control file).

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c   | 171 
 src/bin/pg_controldata/pg_controldata.c |   2 +
 src/include/catalog/pg_control.h|   8 +-
 3 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71cbe0e..75a6862 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -833,6 +833,12 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void pre_sync_fname(char *fname, bool isdir);
+static void walkdir(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_links(char *path,
+			 void (*action) (char *fname, bool isdir));
+static void fsync_recursively(char *pg_data);
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -4721,6 +4727,7 @@ BootStrapXLOG(void)
 	ControlFile->checkPoint = checkPoint.redo;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->unloggedLSN = 1;
+	ControlFile->fsync_disabled = false;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5787,6 +5794,27 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg("control file contains invalid data")));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 *
+	 * We also do this if the control file indicates that fsync was
+	 * disabled at some point while the server was running earlier.
+	 */
+
+	if (enableFsync &&
+		(ControlFile->fsync_disabled ||
+		 (ControlFile->state != DB_SHUTDOWNED &&
+		  ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)))
+	{
+		fsync_recursively(data_directory);
+		ControlFile->fsync_disabled = false;
+	}
+
 	if (ControlFile->state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -8137,6 +8165,8 @@ CreateCheckPoint(int flags)
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
 	ControlFile->minRecoveryPointTLI = 0;
+	if (!enableFsync)
+		ControlFile->fsync_disabled = true;
 
 	/*
 	 * Persist unloggedLSN value. It's reset on crash recovery, so this goes
@@ -11008,3 +11038,144 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+static void
+pre_sync_fname(char *fname, bool isdir)
+{
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd 

Re: [HACKERS] initdb -S and tablespaces

2015-01-15 Thread Andres Freund
On 2015-01-15 11:02:43 +0530, Abhijit Menon-Sen wrote:
> At 2015-01-14 11:59:08 +0100, and...@2ndquadrant.com wrote:
> >
> > > + if (ControlFile->state != DB_SHUTDOWNED &&
> > > + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> > > + perform_fsync(data_directory);
> > > +
> > 
> > a) Please think of a slightly more descriptive name than perform_fsync
> 
> OK. (I just copied the name initdb uses, because at the time I was still
> thinking in terms of a later patch moving this to src/common.) What do
> you think of fsync_recursively? fsync_pgdata?

I like fsync_pgdata/datadir or something.

Note that I think you'll have to check/handle pg_xlog being a symlink -
we explicitly support that as a usecase...

> > c) I'm wondering if we should add fsync to the control file and also
> >perform an fsync if the last shutdown was clear, but fsync was
> >disabled.
> 
> Explain? "Add fsync to the control file" means store the value of the
> fsync GUC setting in the control file?

Yes.

> And would the fsync you mention be dependent on the setting, or unconditional?

What I am thinking of is that, currently, if you start the server for
initial loading with fsync=off, and then restart it, you're open to data
loss. So when the current config file setting is changed from off to on,
we should fsync the data directory. Even if there was no crash restart.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-01-14 Thread Abhijit Menon-Sen
At 2015-01-14 11:59:08 +0100, and...@2ndquadrant.com wrote:
>
> > +   if (ControlFile->state != DB_SHUTDOWNED &&
> > +   ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> > +   perform_fsync(data_directory);
> > +
> 
> a) Please think of a slightly more descriptive name than perform_fsync

OK. (I just copied the name initdb uses, because at the time I was still
thinking in terms of a later patch moving this to src/common.) What do
you think of fsync_recursively? fsync_pgdata?

I think fsync_recursively(data_directory) reads well.

> b) I think we should check here whether fsync is enabled.

OK, will do.

> c) I'm wondering if we should add fsync to the control file and also
>perform an fsync if the last shutdown was clear, but fsync was
>disabled.

Explain? "Add fsync to the control file" means store the value of the
fsync GUC setting in the control file? And would the fsync you mention
be dependent on the setting, or unconditional?

> > +static void
> > +pre_sync_fname(char *fname, bool isdir)
> > +{
> 
> this essentially already exists in the backend inparts. C.f.
> pg_flush_data.

OK, I missed that. I'll rework the patch to use it.

> Theoretically you should also invoke fsync on directories.

OK.

> > +* We need to name the parent of PGDATA.  get_parent_directory() isn't
> > +* enough here, because it can result in an empty string.
> > +*/
> > +   snprintf(pdir, MAXPGPATH, "%s/..", pg_data);
> > +   canonicalize_path(pdir);
> 
> Hm. Why is this neded? Just syncing . should work?

Not sure, will investigate.

Thanks.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2015-01-14 Thread Andres Freund
Hi,

On 2014-11-06 17:56:53 +0530, Abhijit Menon-Sen wrote:
> + /*
> +  * If we need to perform crash recovery, we issue an fsync on the
> +  * data directory and its contents to try to ensure that any data
> +  * written before the crash are flushed to disk. Otherwise a power
> +  * failure in the near future might cause earlier unflushed writes
> +  * to be lost, even though more recent data written to disk from
> +  * here on would be persisted.
> +  */
> +
> + if (ControlFile->state != DB_SHUTDOWNED &&
> + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> + perform_fsync(data_directory);
> +

a) Please think of a slightly more descriptive name than perform_fsync
b) I think we should check here whether fsync is enabled.
c) I'm wondering if we should add fsync to the control file and also
   perform an fsync if the last shutdown was clear, but fsync was
   disabled.
 
>   if (ControlFile->state == DB_SHUTDOWNED)
>   {
>   /* This is the expected case, so don't be chatty in standalone 
> mode */
> @@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping)
>   XLogCtl->WalWriterSleeping = sleeping;
>   SpinLockRelease(&XLogCtl->info_lck);
>  }
> +
> +/*
> + * Hint to the OS that it should get ready to fsync() this file.
> + *
> + * Adapted from pre_sync_fname in initdb.c
> + */
> +static void
> +pre_sync_fname(char *fname, bool isdir)
> +{

this essentially already exists in the backend inparts. C.f. pg_flush_data.

> +/*
> + * walkdir: recursively walk a directory, applying the action to each
> + * regular file and directory (including the named directory itself).
> + *
> + * Adapted from copydir() in copydir.c.
> + */
> +static void
> +walkdir(char *path, void (*action) (char *fname, bool isdir))
> +{
> + DIR*dir;
> + struct dirent *de;
> +
> + dir = AllocateDir(path);
> + while ((de = ReadDir(dir, path)) != NULL)
> + {
> + charsubpath[MAXPGPATH];
> + struct stat fst;
> +
> + CHECK_FOR_INTERRUPTS();
> +
> + if (strcmp(de->d_name, ".") == 0 ||
> + strcmp(de->d_name, "..") == 0)
> + continue;
> +
> + snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
> +
> + if (lstat(subpath, &fst) < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> +  errmsg("could not stat file \"%s\": 
> %m", subpath)));
> +
> + if (S_ISDIR(fst.st_mode))
> + walkdir(subpath, action);
> + else if (S_ISREG(fst.st_mode))
> + (*action) (subpath, false);

Theoretically you should also invoke fsync on directories.

> +/*
> + * Issue fsync recursively on PGDATA and all its contents, including the
> + * links under pg_tblspc.
> + *
> + * Adapted from perform_fsync in initdb.c
> + */
> +static void
> +perform_fsync(char *pg_data)
> +{
> + charpdir[MAXPGPATH];
> + charpg_tblspc[MAXPGPATH];
> +
> + /*
> +  * We need to name the parent of PGDATA.  get_parent_directory() isn't
> +  * enough here, because it can result in an empty string.
> +  */
> + snprintf(pdir, MAXPGPATH, "%s/..", pg_data);
> + canonicalize_path(pdir);

Hm. Why is this neded? Just syncing . should work?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-11-14 Thread Andres Freund
On 2014-10-30 14:30:28 +0530, Abhijit Menon-Sen wrote:
> Here's a proposed patch to initdb to make initdb -S fsync everything
> under pg_tblspc. It introduces a new function that calls walkdir on
> every entry under pg_tblspc. This is only one approach: I could have
> also changed walkdir to follow links, but that would have required a
> bunch of #ifdefs for Windows (because it doesn't have symlinks), and I
> guessed a separate function with two calls might be easier to patch into
> back branches. I've tested this patch under various conditions on Linux,
> but it could use some testing on Windows.

I've pushed this. The windows buildfarm animals that run pg_upgrade (and
thus --sync-only) will have to tell us whether there's a problem. I sure
hope there's none...

Thanks for that patch!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-11-06 Thread Abhijit Menon-Sen
At 2014-10-30 14:30:27 +0530, a...@2ndquadrant.com wrote:
>
> Here's a proposed patch to initdb to make initdb -S fsync everything
> under pg_tblspc.

Oops, I meant to include the corresponding patch to xlog.c to do the
same at startup. It's based on the initdb patch, but modified to not
use fprintf/exit_nicely and so on. (Note that this was written in a
single chunk to aid backpatching. There's no attempt made to share
code in this set of patches.)

Now attached.

-- Abhijit
>From ae91da4df7ee60e6f83c98801e979929442f588a Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Thu, 6 Nov 2014 00:45:56 +0530
Subject: If we need to perform crash recovery, fsync PGDATA recursively

This is so that we don't lose older unflushed writes in the event of
a power failure after crash recovery, where more recent writes are
preserved.

See 20140918083148.ga17...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c | 184 ++
 1 file changed, 184 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ab04380..ef95f64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -830,6 +830,12 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static void pre_sync_fname(char *fname, bool isdir);
+static void walkdir(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_links(char *path,
+			 void (*action) (char *fname, bool isdir));
+static void perform_fsync(char *pg_data);
+
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
  * with the body of the record being the data chunk(s) described by
@@ -5981,6 +5987,19 @@ StartupXLOG(void)
 		ereport(FATAL,
 (errmsg("control file contains invalid data")));
 
+	/*
+	 * If we need to perform crash recovery, we issue an fsync on the
+	 * data directory and its contents to try to ensure that any data
+	 * written before the crash are flushed to disk. Otherwise a power
+	 * failure in the near future might cause earlier unflushed writes
+	 * to be lost, even though more recent data written to disk from
+	 * here on would be persisted.
+	 */
+
+	if (ControlFile->state != DB_SHUTDOWNED &&
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		perform_fsync(data_directory);
+
 	if (ControlFile->state == DB_SHUTDOWNED)
 	{
 		/* This is the expected case, so don't be chatty in standalone mode */
@@ -11262,3 +11281,168 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+/*
+ * Hint to the OS that it should get ready to fsync() this file.
+ *
+ * Adapted from pre_sync_fname in initdb.c
+ */
+static void
+pre_sync_fname(char *fname, bool isdir)
+{
+#if defined(HAVE_SYNC_FILE_RANGE) || \
+	(defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED))
+	int			fd;
+
+	fd = open(fname, O_RDONLY | PG_BINARY);
+
+	/*
+	 * Some OSs don't allow us to open directories at all (Windows returns
+	 * EACCES)
+	 */
+	if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
+		return;
+
+	if (fd < 0)
+		ereport(FATAL,
+(errmsg("could not open file \"%s\" before fsync",
+		fname)));
+
+	/*
+	 * Prefer sync_file_range, else use posix_fadvise.  We ignore any error
+	 * here since this operation is only a hint anyway.
+	 */
+#if defined(HAVE_SYNC_FILE_RANGE)
+	sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
+#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+	posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+#endif
+
+	close(fd);
+#endif
+}
+
+/*
+ * walkdir: recursively walk a directory, applying the action to each
+ * regular file and directory (including the named directory itself).
+ *
+ * Adapted from copydir() in copydir.c.
+ */
+static void
+walkdir(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *de;
+
+	dir = AllocateDir(path);
+	while ((de = ReadDir(dir, path)) != NULL)
+	{
+		char		subpath[MAXPGPATH];
+		struct stat fst;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (strcmp(de->d_name, ".") == 0 ||
+			strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, de->d_name);
+
+		if (lstat(subpath, &fst) < 0)
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not stat file \"%s\": %m", subpath)));
+
+		if (S_ISDIR(fst.st_mode))
+			walkdir(subpath, action);
+		else if (S_ISREG(fst.st_mode))
+			(*action) (subpath, false);
+	}
+	FreeDir(dir);
+
+	/*
+	 * It's important to fsync the destination directory itself as individual
+	 * file fsyncs don't guarantee that the directory entry for the file is
+	 * synced.  Recent versions of ext4 have made the window much wider but
+	 * it's been an issue for ext3 and other filesystems in the past.
+	 */
+	(*action) (path, true);
+}
+
+/*
+ * walktblspc_links: call walkdir on each ent

Re: [HACKERS] initdb -S and tablespaces

2014-10-30 Thread Abhijit Menon-Sen
At 2014-09-29 11:54:10 +0200, and...@2ndquadrant.com wrote:
>
> On 2014-09-29 14:09:01 +0530, Abhijit Menon-Sen wrote:
> > 
> > I just noticed that initdb -S ("Safely write all database files to disk
> > and exit") does (only) the following in perform_fsync:
> > 
> > pre_sync_fname(pdir, true);
> > walkdir(pg_data, pre_sync_fname);
> > 
> > fsync_fname(pdir, true);
> > walkdir(pg_data, fsync_fname);
> > 
> > walkdir() reads the directory and calls itself recursively for S_ISDIR
> > entries, or calls the function for S_ISREG entries… which means it
> > doesn't follow links.
> > 
> > Which means it doesn't fsync the contents of tablespaces.
> 
> Which means at least pg_upgrade is unsafe right
> now... c.f. 630cd14426dc1daf85163ad417f3a224eb4ac7b0.

Here's a proposed patch to initdb to make initdb -S fsync everything
under pg_tblspc. It introduces a new function that calls walkdir on
every entry under pg_tblspc. This is only one approach: I could have
also changed walkdir to follow links, but that would have required a
bunch of #ifdefs for Windows (because it doesn't have symlinks), and I
guessed a separate function with two calls might be easier to patch into
back branches. I've tested this patch under various conditions on Linux,
but it could use some testing on Windows.

-- Abhijit
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index dc1f1df..4fdc9eb 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -218,6 +218,7 @@ static char **filter_lines_with_token(char **lines, const char *token);
 static char **readfile(const char *path);
 static void writefile(char *path, char **lines);
 static void walkdir(char *path, void (*action) (char *fname, bool isdir));
+static void walktblspc_links(char *path, void (*action) (char *fname, bool isdir));
 static void pre_sync_fname(char *fname, bool isdir);
 static void fsync_fname(char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
@@ -588,6 +589,53 @@ walkdir(char *path, void (*action) (char *fname, bool isdir))
 }
 
 /*
+ * walktblspc_links: call walkdir on each entry under the given
+ * pg_tblspc directory, or do nothing if pg_tblspc doesn't exist.
+ */
+static void
+walktblspc_links(char *path, void (*action) (char *fname, bool isdir))
+{
+	DIR		   *dir;
+	struct dirent *direntry;
+	char		subpath[MAXPGPATH];
+
+	dir = opendir(path);
+	if (dir == NULL)
+	{
+		if (errno == ENOENT)
+			return;
+		fprintf(stderr, _("%s: could not open directory \"%s\": %s\n"),
+progname, path, strerror(errno));
+		exit_nicely();
+	}
+
+	while (errno = 0, (direntry = readdir(dir)) != NULL)
+	{
+		if (strcmp(direntry->d_name, ".") == 0 ||
+			strcmp(direntry->d_name, "..") == 0)
+			continue;
+
+		snprintf(subpath, MAXPGPATH, "%s/%s", path, direntry->d_name);
+
+		walkdir(subpath, action);
+	}
+
+	if (errno)
+	{
+		fprintf(stderr, _("%s: could not read directory \"%s\": %s\n"),
+progname, path, strerror(errno));
+		exit_nicely();
+	}
+
+	if (closedir(dir))
+	{
+		fprintf(stderr, _("%s: could not close directory \"%s\": %s\n"),
+progname, path, strerror(errno));
+		exit_nicely();
+	}
+}
+
+/*
  * Hint to the OS that it should get ready to fsync() this file.
  */
 static void
@@ -2377,6 +2425,7 @@ static void
 perform_fsync(void)
 {
 	char		pdir[MAXPGPATH];
+	char		pg_tblspc[MAXPGPATH];
 
 	fputs(_("syncing data to disk ... "), stdout);
 	fflush(stdout);
@@ -2398,6 +2447,11 @@ perform_fsync(void)
 	/* then recursively through the directory */
 	walkdir(pg_data, pre_sync_fname);
 
+	/* now do the same thing for everything under pg_tblspc */
+
+	snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);
+	walktblspc_links(pg_tblspc, pre_sync_fname);
+
 	/*
 	 * Now, do the fsync()s in the same order.
 	 */
@@ -2408,6 +2462,8 @@ perform_fsync(void)
 	/* then recursively through the directory */
 	walkdir(pg_data, fsync_fname);
 
+	walktblspc_links(pg_tblspc, fsync_fname);
+
 	check_ok();
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-09-29 Thread Abhijit Menon-Sen
At 2014-09-29 12:59:09 +0200, and...@2ndquadrant.com wrote:
>
> So I'm afraid at least in a first patch it'll need to be a bit of
> duplication. Fixing initdb's code back to 9.3 and the backend all
> the way back to 9.0.

OK, thanks, I'll submit two separate patches then.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-09-29 Thread Andres Freund
On 2014-09-29 15:43:32 +0530, Abhijit Menon-Sen wrote:
> At 2014-09-29 11:54:10 +0200, and...@2ndquadrant.com wrote:
> >
> > Note that the perform_fsync() *was* ok for its original purpose in
> > initdb. At the end of initdb there's no relevant tablespaces. But if
> > used *after* pg_upgrade, that's not necessarily the case.
> 
> Right.
> 
> So, since I'm writing a function to fsync everything inside PGDATA
> anyway, it makes sense to call it both from initdb and StartupXLOG.
> It'll do what initdb -S now does, plus follow links in pg_tblspc.
> 
> Any suggestions about where to put such a function? (I was looking at
> backend/utils/init, but I'm not sure that's a good place for this.)

That can't work unfortunately. Both frontend and backend code need to
execute it... I'm not sure it's realistic to handle both cases the
same. The error handling, opening files/directories, and all will be
different. It'll also make backpatching hard :(.

So I'm afraid at least in a first patch it'll need to be a bit of
duplication. Fixing initdb's code back to 9.3 and the backend all the
way back to 9.0.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-09-29 Thread Abhijit Menon-Sen
At 2014-09-29 11:54:10 +0200, and...@2ndquadrant.com wrote:
>
> Note that the perform_fsync() *was* ok for its original purpose in
> initdb. At the end of initdb there's no relevant tablespaces. But if
> used *after* pg_upgrade, that's not necessarily the case.

Right.

So, since I'm writing a function to fsync everything inside PGDATA
anyway, it makes sense to call it both from initdb and StartupXLOG.
It'll do what initdb -S now does, plus follow links in pg_tblspc.

Any suggestions about where to put such a function? (I was looking at
backend/utils/init, but I'm not sure that's a good place for this.)

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb -S and tablespaces

2014-09-29 Thread Andres Freund
On 2014-09-29 14:09:01 +0530, Abhijit Menon-Sen wrote:
> Hi.
> 
> I just noticed that initdb -S ("Safely write all database files to disk
> and exit") does (only) the following in perform_fsync:
> 
> pre_sync_fname(pdir, true);
> walkdir(pg_data, pre_sync_fname);
> 
> fsync_fname(pdir, true);
> walkdir(pg_data, fsync_fname);
> 
> walkdir() reads the directory and calls itself recursively for S_ISDIR
> entries, or calls the function for S_ISREG entries… which means it
> doesn't follow links.
> 
> Which means it doesn't fsync the contents of tablespaces.

Which means at least pg_upgrade is unsafe right
now... c.f. 630cd14426dc1daf85163ad417f3a224eb4ac7b0.

Note that the perform_fsync() *was* ok for its original purpose in
initdb. At the end of initdb there's no relevant tablespaces. But if
used *after* pg_upgrade, that's not necessarily the case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] initdb -S and tablespaces

2014-09-29 Thread Abhijit Menon-Sen
Hi.

I just noticed that initdb -S ("Safely write all database files to disk
and exit") does (only) the following in perform_fsync:

pre_sync_fname(pdir, true);
walkdir(pg_data, pre_sync_fname);

fsync_fname(pdir, true);
walkdir(pg_data, fsync_fname);

walkdir() reads the directory and calls itself recursively for S_ISDIR
entries, or calls the function for S_ISREG entries… which means it
doesn't follow links.

Which means it doesn't fsync the contents of tablespaces.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers