Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Greg Stark
On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas robertmh...@gmail.com wrote:
 Greg Stark, have you managed to get your access issues sorted out?  If

Yep, will look at this today.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Greg Stark
On Sun, Feb 14, 2010 at 2:03 PM, Greg Stark gsst...@mit.edu wrote:
 On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas robertmh...@gmail.com wrote:
 Greg Stark, have you managed to get your access issues sorted out?  If

 Yep, will look at this today.

So I think we have a bigger problem than just copydir.c. It seems to
me we should be fsyncing the table space data directories on every
checkpoint. Otherwise any newly created relations or removed relations
could disappear even though the data in them was fsynced. I'm thinking
I should add an _mdfd_opentblspc(reln) call which returns a file
descriptor for the tablespace and have mdsync() use that to sync the
directory whenever it fsyncs a relation. It would be nice to remember
which tablespaces have been fsynced and only fsync them once though,
that would need another hash table just for tablespaces.

We probably also need to fsync the pg_xlog directory every time we
create or rename an xlog segment.

Are there any other places we do directory operations which we need to
be permanent?


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 So I think we have a bigger problem than just copydir.c. It seems to
 me we should be fsyncing the table space data directories on every
 checkpoint.

Is there any evidence that anyone anywhere has ever lost data because
of a lack of directory fsyncs?  I sure don't recall any bug reports
that seem to match that theory.

It seems to me that we're talking about a huge hit in both code
complexity and performance to deal with a problem that doesn't actually
occur in the field; and which furthermore is trivially solved on any
modern filesystem by choosing the right filesystem options.  Why don't
we just document those options, instead?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Andres Freund
On Sunday 14 February 2010 18:11:39 Tom Lane wrote:
 Greg Stark gsst...@mit.edu writes:
  So I think we have a bigger problem than just copydir.c. It seems to
  me we should be fsyncing the table space data directories on every
  checkpoint.
 
 Is there any evidence that anyone anywhere has ever lost data because
 of a lack of directory fsyncs?  I sure don't recall any bug reports
 that seem to match that theory.
I have actually seen the issue during create database at least. In a 
virtualized hw though...
~1GB template database, lots and lots of small tables, the crash occured maybe 
a minute after CREATE DB, filesystem was xfs, kernel 2.6.30.y.
 
 It seems to me that we're talking about a huge hit in both code
 complexity and performance to deal with a problem that doesn't actually
 occur in the field; and which furthermore is trivially solved on any
 modern filesystem by choosing the right filesystem options.  Why don't
 we just document those options, instead?
Which options would that be? I am not aware that there any for any of the 
recent linux filesystems.
Well, except sync that is, but that sure would be more of a performance hit 
than fsyncing the directory...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On Sunday 14 February 2010 18:11:39 Tom Lane wrote:
 It seems to me that we're talking about a huge hit in both code
 complexity and performance to deal with a problem that doesn't actually
 occur in the field; and which furthermore is trivially solved on any
 modern filesystem by choosing the right filesystem options.  Why don't
 we just document those options, instead?

 Which options would that be? I am not aware that there any for any of the 
 recent linux filesystems.

Shouldn't journaling of metadata be sufficient?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync

2010-02-14 Thread Florian Weimer
* Tom Lane:

 Which options would that be? I am not aware that there any for any of the 
 recent linux filesystems.

 Shouldn't journaling of metadata be sufficient?

You also need to enforce ordering between the directory update and the
file update.  The file metadata is flushed with fsync(), but the
directory isn't.  On some systems, all directory operations are
synchronous, but not on Linux.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync

2010-02-14 Thread Mark Mielke

On 02/14/2010 03:24 PM, Florian Weimer wrote:

* Tom Lane:
   

Which options would that be? I am not aware that there any for any of the
recent linux filesystems.
   

Shouldn't journaling of metadata be sufficient?
 

You also need to enforce ordering between the directory update and the
file update.  The file metadata is flushed with fsync(), but the
directory isn't.  On some systems, all directory operations are
synchronous, but not on Linux.
   


   dirsync
  All directory updates within the filesystem should be 
done  syn-
  chronously.   This  affects  the  following system calls: 
creat,

  link, unlink, symlink, mkdir, rmdir, mknod and rename.

The widely reported problems, though, did not tend to be a problem with 
directory changes written too late - but directory changes being written 
too early. That is, the directory change is written to disk, but the 
file content is not. This is likely because of the ordered journal 
mode widely used in ext3/ext4 where metadata changes are journalled, but 
file pages are not journalled. Therefore, it is important for some 
operations, that the file pages are pushed to disk using fsync(file), 
before the metadata changes are journalled.


In theory there is some open hole where directory updates need to be 
synchronized with file updates, as POSIX doesn't enforce this ordering, 
and we can't trust that all file systems implicitly order things 
correctly, but in practice, I don't see this sort of problem happening.


If you are concerned, enable dirsync.

Cheers,
mark

--
Mark Mielkem...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync

2010-02-14 Thread Andres Freund
On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
 On 02/14/2010 03:24 PM, Florian Weimer wrote:
  * Tom Lane:
  Which options would that be? I am not aware that there any for any of
  the recent linux filesystems.
  
  Shouldn't journaling of metadata be sufficient?
  
  You also need to enforce ordering between the directory update and the
  file update.  The file metadata is flushed with fsync(), but the
  directory isn't.  On some systems, all directory operations are
  synchronous, but not on Linux.
 
 dirsync
All directory updates within the filesystem should be
 done  syn-
chronously.   This  affects  the  following system calls:
 creat,
link, unlink, symlink, mkdir, rmdir, mknod and rename.
 
 The widely reported problems, though, did not tend to be a problem with
 directory changes written too late - but directory changes being written
 too early. That is, the directory change is written to disk, but the
 file content is not. This is likely because of the ordered journal
 mode widely used in ext3/ext4 where metadata changes are journalled, but
 file pages are not journalled. Therefore, it is important for some
 operations, that the file pages are pushed to disk using fsync(file),
 before the metadata changes are journalled.
Well, but thats not a problem with pg as it fsyncs the file contents.

 In theory there is some open hole where directory updates need to be
 synchronized with file updates, as POSIX doesn't enforce this ordering,
 and we can't trust that all file systems implicitly order things
 correctly, but in practice, I don't see this sort of problem happening.
I can try to reproduce it if you want...

 If you are concerned, enable dirsync.
If the filesystem already behaves that way a fsync on it should be fairly 
cheap. If it doesnt behave that way doing it is correct...

Besides there is no reason to fsync the directory before the checkpoint, so 
dirsync would require a higher cost than doing it correctly.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Robert Haas
On Sun, Feb 14, 2010 at 10:31 AM, Greg Stark gsst...@mit.edu wrote:
 On Sun, Feb 14, 2010 at 2:03 PM, Greg Stark gsst...@mit.edu wrote:
 On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas robertmh...@gmail.com wrote:
 Greg Stark, have you managed to get your access issues sorted out?  If

 Yep, will look at this today.

 So I think we have a bigger problem than just copydir.c. It seems to
 me we should be fsyncing the table space data directories on every
 checkpoint. Otherwise any newly created relations or removed relations
 could disappear even though the data in them was fsynced. I'm thinking
 I should add an _mdfd_opentblspc(reln) call which returns a file
 descriptor for the tablespace and have mdsync() use that to sync the
 directory whenever it fsyncs a relation. It would be nice to remember
 which tablespaces have been fsynced and only fsync them once though,
 that would need another hash table just for tablespaces.

 We probably also need to fsync the pg_xlog directory every time we
 create or rename an xlog segment.

 Are there any other places we do directory operations which we need to
 be permanent?

I agree with Tom that we need to see some actual reproducible test
cases where this is an issue before we go too crazy with it.  In
theory what you're talking about could also happen when extending a
relation, if we extend into a new file; but I think we need to
convince ourselves that it really happens before we make any more
changes.

On a pragmatic note, if this does turn out to be a problem, it's a
bug: and we can and do fix bugs whenever we discover them.  But the
other part of this patch - to speed up createdb - is a feature - and
we are very rapidly running out of time for 9.0 features.  So I'd like
to vote for getting the feature part of this committed (assuming it's
in good shape, of course) and we can continue to investigate the other
issues but without quite as much urgency.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Andres Freund
On Sunday 14 February 2010 21:57:08 Robert Haas wrote:
 On Sun, Feb 14, 2010 at 10:31 AM, Greg Stark gsst...@mit.edu wrote:
  On Sun, Feb 14, 2010 at 2:03 PM, Greg Stark gsst...@mit.edu wrote:
  On Fri, Feb 12, 2010 at 3:49 PM, Robert Haas robertmh...@gmail.com 
wrote:
  Greg Stark, have you managed to get your access issues sorted out?  If
  
  Yep, will look at this today.
  
  So I think we have a bigger problem than just copydir.c. It seems to
  me we should be fsyncing the table space data directories on every
  checkpoint. Otherwise any newly created relations or removed relations
  could disappear even though the data in them was fsynced. I'm thinking
  I should add an _mdfd_opentblspc(reln) call which returns a file
  descriptor for the tablespace and have mdsync() use that to sync the
  directory whenever it fsyncs a relation. It would be nice to remember
  which tablespaces have been fsynced and only fsync them once though,
  that would need another hash table just for tablespaces.
  
  We probably also need to fsync the pg_xlog directory every time we
  create or rename an xlog segment.
  
  Are there any other places we do directory operations which we need to
  be permanent?
 
 I agree with Tom that we need to see some actual reproducible test
 cases where this is an issue before we go too crazy with it.  In
 theory what you're talking about could also happen when extending a
 relation, if we extend into a new file; but I think we need to
 convince ourselves that it really happens before we make any more
 changes.
Ok, will try to reproduce.

 On a pragmatic note, if this does turn out to be a problem, it's a
 bug: and we can and do fix bugs whenever we discover them.  But the
 other part of this patch - to speed up createdb - is a feature - and
 we are very rapidly running out of time for 9.0 features.  So I'd like
 to vote for getting the feature part of this committed (assuming it's
 in good shape, of course) and we can continue to investigate the other
 issues but without quite as much urgency.
Sound sensible.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-14 Thread Greg Stark
On Sun, Feb 14, 2010 at 8:57 PM, Robert Haas robertmh...@gmail.com wrote:
 On a pragmatic note, if this does turn out to be a problem, it's a
 bug: and we can and do fix bugs whenever we discover them.  But the
 other part of this patch - to speed up createdb - is a feature - and
 we are very rapidly running out of time for 9.0 features.  So I'd like
 to vote for getting the feature part of this committed (assuming it's
 in good shape, of course) and we can continue to investigate the other
 issues but without quite as much urgency.

No problem, I already committed the part that overlaps so I can commit
the rest now. I just want to take extra care given how much wine I've
already had tonight...

Incidentally, sorry Andres, I forgot to credit you in the first commit.
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync

2010-02-14 Thread Mark Mielke

On 02/14/2010 03:49 PM, Andres Freund wrote:

On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
   

The widely reported problems, though, did not tend to be a problem with
directory changes written too late - but directory changes being written
too early. That is, the directory change is written to disk, but the
file content is not. This is likely because of the ordered journal
mode widely used in ext3/ext4 where metadata changes are journalled, but
file pages are not journalled. Therefore, it is important for some
operations, that the file pages are pushed to disk using fsync(file),
before the metadata changes are journalled.
 

Well, but thats not a problem with pg as it fsyncs the file contents.
   


Exactly. Not a problem.


If you are concerned, enable dirsync.
 

If the filesystem already behaves that way a fsync on it should be fairly
cheap. If it doesnt behave that way doing it is correct...
   


Well, I disagree, as the whole point of this thread is that fsync() is 
*not* cheap. :-)



Besides there is no reason to fsync the directory before the checkpoint, so
dirsync would require a higher cost than doing it correctly.
   


Using ordered metadata journaling has approximately the same effect. 
Provided that the data is fsync()'d before the metadata is required, 
either the metadata is recorded in the journal, in which case the data 
is accessible, or the metadata is NOT recorded in the journal, in which 
case, the files will appear missing. The races that theoretically exist 
would be in situations where the data of one file references a separate 
file that does not yet exist.


You said you would try and reproduce - are you going to try and 
reproduce on ext3/ext4 with ordered journalling enabled? I think 
reproducing outside of a case such as CREATE DATABASE would be 
difficult. It would have to be something like:


open(O_CREAT)/write()/fsync()/close() of new data file, where data 
gets written, but directory data is not yet written out to journal
open()/.../write()/fsync()/close() of existing file to point to new 
data file, but directory data is still not yet written out to journal

crash

In this case, dirsync should be effective at closing this hole.

As for cost? Well, most PostgreSQL data is stored within file content, 
not directory metadata. I think dirsync might slow down some 
operations like CREATE DATABASE or rm -fr, but I would not expect it 
to effect day-to-day performance of the database under real load. Many 
operating systems enable the equivalent of dirsync by default. I 
believe Solaris does this, for example, and other than slowing down rm 
-fr, I don't recall any real complaints about the cost of dirsync.


After writing the above, I'm seriously considering adding dirsync to 
my /db mounts that hold PostgreSQL and MySQL data.


Cheers,
mark

--
Mark Mielkem...@mielke.cc


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-12 Thread Robert Haas
On Wed, Feb 10, 2010 at 9:27 PM, Andres Freund and...@anarazel.de wrote:
 On Monday 08 February 2010 05:53:23 Robert Haas wrote:
 On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera

 alvhe...@commandprompt.com wrote:
  Andres Freund escribió:
  I personally think the fsync on the directory should be added to the
  stable branches - other opinions?
  If wanted I can prepare patches for that.
 
  Yeah, it seems there are two patches here -- one is the addition of
  fsync_fname() and the other is the fsync_prepare stuff.

 Andres, you want to take a crack at splitting this up?
 I hope I didnt duplicate Gregs work, but I didnt hear back from him, so...

 Everything 8.1 is hopeless because cp is used there... I didnt see it worth
 to replace that. The patch applies cleanly for 8.1 to 8.4 and survives the
 regression tests

 Given pg's heavy commit model I didnt see a point to split the patch for 9.0
 as well...

I'd probably argue for committing this patch to both HEAD and the
back-branches, and doing a second commit with the remaining stuff for
HEAD only, but I don't care very much.

Greg Stark, have you managed to get your access issues sorted out?  If
you like, I can do the actual commit on this one.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-10 Thread Andres Freund
On Monday 08 February 2010 05:53:23 Robert Haas wrote:
 On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
 
 alvhe...@commandprompt.com wrote:
  Andres Freund escribió:
  I personally think the fsync on the directory should be added to the
  stable branches - other opinions?
  If wanted I can prepare patches for that.
  
  Yeah, it seems there are two patches here -- one is the addition of
  fsync_fname() and the other is the fsync_prepare stuff.
 
 Andres, you want to take a crack at splitting this up?
I hope I didnt duplicate Gregs work, but I didnt hear back from him, so...

Everything 8.1 is hopeless because cp is used there... I didnt see it worth 
to replace that. The patch applies cleanly for 8.1 to 8.4 and survives the 
regression tests

Given pg's heavy commit model I didnt see a point to split the patch for 9.0 
as well...

Andres
diff --git a/src/port/copydir.c b/src/port/copydir.c
index 72fbf36..b057ffa 100644
*** a/src/port/copydir.c
--- b/src/port/copydir.c
*** copydir(char *fromdir, char *todir, bool
*** 50,55 
--- 50,56 
  {
  	DIR		   *xldir;
  	struct dirent *xlde;
+ 	int dirfd;
  	char		fromfile[MAXPGPATH];
  	char		tofile[MAXPGPATH];
  
*** copydir(char *fromdir, char *todir, bool
*** 91,96 
--- 92,116 
  	}
  
  	FreeDir(xldir);
+ 
+ 	/*
+ 	 * fsync the directory to make sure data has reached the
+ 	 * disk. While needed by most filesystems, the window got bigger
+ 	 * with newer ones like ext4.
+ 	 */
+ 	dirfd = BasicOpenFile(todir,
+ 	  O_RDONLY | PG_BINARY,
+ 	  S_IRUSR | S_IWUSR);
+ 	if(dirfd == -1)
+ 		ereport(ERROR,
+ 		(errcode_for_file_access(),
+ 		 errmsg(could not open directory for fsync \%s\: %m, todir)));
+ 
+ 	if(pg_fsync(dirfd) == -1)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg(could not fsync directory \%s\: %m, todir)));
+ 	close(dirfd);
  }
  
  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-08 Thread Greg Stark
On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
 Yeah, it seems there are two patches here -- one is the addition of
 fsync_fname() and the other is the fsync_prepare stuff.

Sorry, I'm just catching up on my mail from FOSDEM this past weekend.

I had come to the same conclusion as Greg that I might as well just
commit it with Tom's pg_flush_data() name and we can decide later if
and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
keeping two apis or not.

So I was just going to commit it like that but I discovered last week
that I don't have cvs write access set up yet. I'll commit it as soon
as I generate a new ssh key and Dave installs it, etc. I intentionally
picked a small simple patch that nobody was waiting on because I knew
there was a risk of delays like this and the paperwork. I'm nearly
there.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-08 Thread Andres Freund
On Monday 08 February 2010 19:34:01 Greg Stark wrote:
 On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas robertmh...@gmail.com wrote:
  On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
  
  Yeah, it seems there are two patches here -- one is the addition of
  fsync_fname() and the other is the fsync_prepare stuff.
 
 Sorry, I'm just catching up on my mail from FOSDEM this past weekend.
 
 I had come to the same conclusion as Greg that I might as well just
 commit it with Tom's pg_flush_data() name and we can decide later if
 and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
 keeping two apis or not.
 
 So I was just going to commit it like that but I discovered last week
 that I don't have cvs write access set up yet. I'll commit it as soon
 as I generate a new ssh key and Dave installs it, etc. I intentionally
 picked a small simple patch that nobody was waiting on because I knew
 there was a risk of delays like this and the paperwork. I'm nearly
 there.
Do you still want me to split the patches into two or do you want to do it 
yourself?
One in multiple versions for the directory fsync and another one for 9.0?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Greg Smith

Robert Haas wrote:

Well it seems that what we're trying to implement is more like
it_would_be_nice_if_you_would_start_syncing_this_file_range_but_its_ok_if_you_dont(),
so maybe that would work.

Anyway, is there something that we can agree on and get committed here
for 9.0, or should we postpone this to 9.1?  It seems simple enough
that we ought to be able to get it done, but we're running out of time
and we don't seem to have a clear vision here yet...
  


This is turning into yet another one of those situations where something 
simple and useful is being killed by trying to generalize it way more 
than it needs to be, given its current goals and its lack of external 
interfaces.  There's no catversion bump or API breakage to hinder future 
refactoring if this isn't optimally designed internally from day one.


The feature is valuable and there seems at least one spot where it may 
be resolving the possibility of a subtle OS interaction bug by being 
more thorough in the way that it writes and syncs.  The main contention 
seems to be over naming and completely optional additional abstraction.  
I consider the whole let's make this cover every type of complicated 
sync on every platform goal interesting and worthwhile, but it's 
completely optional for this release.  The stuff being fretted over now 
is ultimately an internal interface that can be refactored at will in 
later releases with no user impact.


If the goal here could be shifted back to finding the minimal level of 
abstraction that doesn't seem completely wrong, then updating the 
function names and comments to match that more closely, this could 
return to committable.  That's all I thought was left to do when I moved 
it to ready for committer, and as far as I've seen this expanded scope 
of discussion has just moved backwards from that point.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 This is turning into yet another one of those situations where something 
 simple and useful is being killed by trying to generalize it way more 
 than it needs to be, given its current goals and its lack of external 
 interfaces.  There's no catversion bump or API breakage to hinder future 
 refactoring if this isn't optimally designed internally from day one.

I agree that it's too late in the cycle for any major redesign of the
patch.  But is it too much to ask to use a less confusing name for the
function?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Robert Haas
On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Smith g...@2ndquadrant.com writes:
 This is turning into yet another one of those situations where something
 simple and useful is being killed by trying to generalize it way more
 than it needs to be, given its current goals and its lack of external
 interfaces.  There's no catversion bump or API breakage to hinder future
 refactoring if this isn't optimally designed internally from day one.

 I agree that it's too late in the cycle for any major redesign of the
 patch.  But is it too much to ask to use a less confusing name for the
 function?

+1.  Let's just rename the thing, add some comments, and call it good.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Andres Freund
On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
 On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Greg Smith g...@2ndquadrant.com writes:
  This is turning into yet another one of those situations where something
  simple and useful is being killed by trying to generalize it way more
  than it needs to be, given its current goals and its lack of external
  interfaces.  There's no catversion bump or API breakage to hinder future
  refactoring if this isn't optimally designed internally from day one.
  
  I agree that it's too late in the cycle for any major redesign of the
  patch.  But is it too much to ask to use a less confusing name for the
  function?
 
 +1.  Let's just rename the thing, add some comments, and call it good.
Will post a updated patch in the next hours unless somebody beats me too it.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Andres Freund
On Sunday 07 February 2010 19:27:02 Andres Freund wrote:
 On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
  On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:
   Greg Smith g...@2ndquadrant.com writes:
   This is turning into yet another one of those situations where
   something simple and useful is being killed by trying to generalize
   it way more than it needs to be, given its current goals and its lack
   of external interfaces.  There's no catversion bump or API breakage
   to hinder future refactoring if this isn't optimally designed
   internally from day one.
   
   I agree that it's too late in the cycle for any major redesign of the
   patch.  But is it too much to ask to use a less confusing name for the
   function?
  
  +1.  Let's just rename the thing, add some comments, and call it good.
 
 Will post a updated patch in the next hours unless somebody beats me too
 it.
Here we go.

I left the name at my suggestion pg_fsync_prepare instead of Tom's 
prepare_for_fsync because it seemed more consistend with the naming in the 
rest of the file. Obviously feel free to adjust.

I personally think the fsync on the directory should be added to the stable 
branches - other opinions?
If wanted I can prepare patches for that.

Andres
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 7ffa2eb..bc5753a 100644
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
*** pg_fdatasync(int fd)
*** 320,325 
--- 320,361 
  }
  
  /*
+  * pg_fsync_prepare --- try to make a later fsync on the same file faster
+  *
+  * A call to this function does not guarantee anything!
+  *
+  * The idea is to tell the kernel to write out its cache so that a
+  * fsync later on has less to write out synchronously. This allows
+  * that write requests get reordered more freely.
+  *
+  * In the current implementation this has the additional effect of
+  * dropping the cache in that region and thus can be used to avoid
+  * cache poisoning. This may or may not be wanted.
+  *
+  * XXX: Ideally this API would use sync_file_range (or similar on
+  * platforms other than linux) and a seperate one for cache
+  * control. We are not there yet.
+  *
+  * Look at the thread below 200912282354.51892.and...@anarazel.de in
+  * pgsql-hackers for a longer discussion.
+  */
+ int
+ pg_fsync_prepare(int fd, off_t offset, off_t amount)
+ {
+ 	if (enableFsync)
+ 	{
+ #if defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED)
+ 		return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED);
+ #else
+ 		return 0;
+ #endif
+ 	}
+ 	else
+ 		return 0;
+ }
+ 
+ 
+ /*
   * InitFileAccess --- initialize this module during backend startup
   *
   * This is called during either normal or standalone backend start.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 21cb024..b1a4b49 100644
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
*** extern int	pg_fsync_no_writethrough(int 
*** 99,104 
--- 99,106 
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  
+ extern int	prepare_for_fsync(int fd, off_t offset, off_t amount);
+ 
  /* Filename components for OpenTemporaryFile */
  #define PG_TEMP_FILES_DIR pgsql_tmp
  #define PG_TEMP_FILE_PREFIX pgsql_tmp
diff --git a/src/port/copydir.c b/src/port/copydir.c
index 72fbf36..eef3cfb 100644
*** a/src/port/copydir.c
--- b/src/port/copydir.c
***
*** 37,42 
--- 37,43 
  
  
  static void copy_file(char *fromfile, char *tofile);
+ static void fsync_fname(char *fname);
  
  
  /*
*** copydir(char *fromdir, char *todir, bool
*** 64,69 
--- 65,73 
  (errcode_for_file_access(),
   errmsg(could not open directory \%s\: %m, fromdir)));
  
+ 	/*
+ 	 * Copy all the files
+ 	 */
  	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
  	{
  		struct stat fst;
*** copydir(char *fromdir, char *todir, bool
*** 89,96 
--- 93,127 
  		else if (S_ISREG(fst.st_mode))
  			copy_file(fromfile, tofile);
  	}
+ 	FreeDir(xldir);
+ 
+ 	/*
+ 	 * Be paranoid here and fsync all files to ensure we catch problems.
+ 	 */
+ 	xldir = AllocateDir(fromdir);
+ 	if (xldir == NULL)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg(could not open directory \%s\: %m, fromdir)));
+ 
+ 	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
+ 	{
+ 		if (strcmp(xlde-d_name, .) == 0 ||
+ 			strcmp(xlde-d_name, ..) == 0)
+ 			continue;
  
+ 		snprintf(tofile, MAXPGPATH, %s/%s, todir, xlde-d_name);
+ 		fsync_fname(tofile);
+ 	}
  	FreeDir(xldir);
+ 
+ 	/* It's important to fsync the destination directory itself as
+ 	 * individual file fsyncs don't guarantee that the directory entry
+ 	 * for the file is synced. Recent versions of ext4 have made the
+ 	 * window much wider but it's been true for ext3 and other
+ 	 * filesyetems in the past
+ 	 */
+ 	fsync_fname(todir);
  }
  
  /*
*** copy_file(char 

Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Alvaro Herrera
Andres Freund escribió:

 I personally think the fsync on the directory should be added to the stable 
 branches - other opinions?
 If wanted I can prepare patches for that.

Yeah, it seems there are two patches here -- one is the addition of
fsync_fname() and the other is the fsync_prepare stuff.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Robert Haas
On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Andres Freund escribió:
 I personally think the fsync on the directory should be added to the stable
 branches - other opinions?
 If wanted I can prepare patches for that.

 Yeah, it seems there are two patches here -- one is the addition of
 fsync_fname() and the other is the fsync_prepare stuff.

Andres, you want to take a crack at splitting this up?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-07 Thread Andres Freund
On Monday 08 February 2010 05:53:23 Robert Haas wrote:
 On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
 
 alvhe...@commandprompt.com wrote:
  Andres Freund escribió:
  I personally think the fsync on the directory should be added to the
  stable branches - other opinions?
  If wanted I can prepare patches for that.
  
  Yeah, it seems there are two patches here -- one is the addition of
  fsync_fname() and the other is the fsync_prepare stuff.
 
 Andres, you want to take a crack at splitting this up?
Will do. Later today or tomorrow morning.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-06 Thread Andres Freund
On Saturday 06 February 2010 06:03:30 Greg Smith wrote:
 Andres Freund wrote:
  On 02/03/10 14:42, Robert Haas wrote:
  Well, maybe we should start with a discussion of what kernel calls
  you're aware of on different platforms and then we could try to put an
  API around it.
  
  In linux there is sync_file_range. On newer Posixish systems one can
  emulate that with mmap() and msync() (in batches obviously).
  
  No idea about windows.
 The effective_io_concurrency feature had proof of concept test programs
 that worked using AIO, but actually following through on that
 implementation would require a major restructuring of how the database
 interacts with the OS in terms of reads and writes of blocks.  It looks
 to me like doing something similar to sync_file_range on Windows would
 be similarly difficult.
Looking a bit arround it seems one could achieve something approximediately 
similar to pg_prepare_fsync() by using
CreateFileMapping  MapViewOfFile  FlushViewOfFile 

If I understand it correctly that will flush, but not wait. Unfortunately you 
cant event make it wait, so its not possible to implement sync_file_range or 
similar fully.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-06 Thread Robert Haas
On Sat, Feb 6, 2010 at 7:03 AM, Andres Freund and...@anarazel.de wrote:
 On Saturday 06 February 2010 06:03:30 Greg Smith wrote:
 Andres Freund wrote:
  On 02/03/10 14:42, Robert Haas wrote:
  Well, maybe we should start with a discussion of what kernel calls
  you're aware of on different platforms and then we could try to put an
  API around it.
 
  In linux there is sync_file_range. On newer Posixish systems one can
  emulate that with mmap() and msync() (in batches obviously).
 
  No idea about windows.
 The effective_io_concurrency feature had proof of concept test programs
 that worked using AIO, but actually following through on that
 implementation would require a major restructuring of how the database
 interacts with the OS in terms of reads and writes of blocks.  It looks
 to me like doing something similar to sync_file_range on Windows would
 be similarly difficult.
 Looking a bit arround it seems one could achieve something approximediately
 similar to pg_prepare_fsync() by using
 CreateFileMapping  MapViewOfFile  FlushViewOfFile

 If I understand it correctly that will flush, but not wait. Unfortunately you
 cant event make it wait, so its not possible to implement sync_file_range or
 similar fully.

Well it seems that what we're trying to implement is more like
it_would_be_nice_if_you_would_start_syncing_this_file_range_but_its_ok_if_you_dont(),
so maybe that would work.

Anyway, is there something that we can agree on and get committed here
for 9.0, or should we postpone this to 9.1?  It seems simple enough
that we ought to be able to get it done, but we're running out of time
and we don't seem to have a clear vision here yet...

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-05 Thread Greg Smith

Andres Freund wrote:

On 02/03/10 14:42, Robert Haas wrote:

Well, maybe we should start with a discussion of what kernel calls
you're aware of on different platforms and then we could try to put an
API around it.
In linux there is sync_file_range. On newer Posixish systems one can 
emulate that with mmap() and msync() (in batches obviously).


No idea about windows.


There's a series of parameters you can pass into CreateFile:  
http://msdn.microsoft.com/en-us/library/aa363858(VS.85).aspx


A lot of these are already mapped inside of src/port/open.c in a pretty 
straightforward way from the POSIX-oriented interface:


O_RDWR,O_WRONLY - GENERIC_WRITE, GENERIC_READ
O_RANDOM - FILE_FLAG_RANDOM_ACCESS
O_SEQUENTIAL - FILE_FLAG_SEQUENTIAL_SCAN
O_SHORT_LIVED - FILE_ATTRIBUTE_TEMPORARY
O_TEMPORARY - FILE_FLAG_DELETE_ON_CLOSE
O_DIRECT - FILE_FLAG_NO_BUFFERING
O_DSYNC - FILE_FLAG_WRITE_THROUGH

You have to read the whole Caching Behavior section to see exactly how 
all of those interact, and even then notes like 
http://support.microsoft.com/kb/99794 are needed to follow the fine 
points of things like FILE_FLAG_NO_BUFFERING vs. FILE_FLAG_WRITE_THROUGH.


So anything that's setting those POSIX open flags better than before is 
getting the benefit of that improvement on Windows, too.  But that's not 
quite the same as the changes using fadvise to provide better targeted 
cache control hints.


I'm getting the impression that doing much better on Windows might fall 
into the same sort of category as Solaris, where the primary interface 
for this sort of thing is to use an AIO implementation instead:  
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx


The effective_io_concurrency feature had proof of concept test programs 
that worked using AIO, but actually following through on that 
implementation would require a major restructuring of how the database 
interacts with the OS in terms of reads and writes of blocks.  It looks 
to me like doing something similar to sync_file_range on Windows would 
be similarly difficult.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-03 Thread Greg Stark
On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas robertmh...@gmail.com wrote:
 I think you're probably right, but it's not clear what the new name
 should be until we have a comment explaining what the function is
 responsible for.

So I wrote some comments but wasn't going to repost the patch with the
unchanged name without explanation... But I think you're right though
I was looking at it the other way around. I want to have an API for a
two-stage sync and of course if I do that I'll comment it to explain
that clearly.

The gist of the comments was that the function is preparing to fsync
to initiate the i/o early and allow the later fsync to fast -- but
also at the same time have the beneficial side-effect of avoiding
cache poisoning. It's not clear that the two are necessarily linked
though. Perhaps we need two separate apis, though it'll be hard to
keep them separate on all platforms.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-03 Thread Andres Freund

On 02/03/10 12:53, Greg Stark wrote:

On Tue, Feb 2, 2010 at 7:45 PM, Robert Haasrobertmh...@gmail.com  wrote:

I think you're probably right, but it's not clear what the new name
should be until we have a comment explaining what the function is
responsible for.


So I wrote some comments but wasn't going to repost the patch with the
unchanged name without explanation... But I think you're right though
I was looking at it the other way around. I want to have an API for a
two-stage sync and of course if I do that I'll comment it to explain
that clearly.

The gist of the comments was that the function is preparing to fsync
to initiate the i/o early and allow the later fsync to fast -- but
also at the same time have the beneficial side-effect of avoiding
cache poisoning. It's not clear that the two are necessarily linked
though. Perhaps we need two separate apis, though it'll be hard to
keep them separate on all platforms.
I vote for two seperate apis - sure, there will be some unfortunate 
overlap for most unixoid platforms but its sure better possibly to allow 
adding more platforms later at a centralized place than having to 
analyze every place where the api is used.


Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-03 Thread Robert Haas
On Wed, Feb 3, 2010 at 6:53 AM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas robertmh...@gmail.com wrote:
 I think you're probably right, but it's not clear what the new name
 should be until we have a comment explaining what the function is
 responsible for.

 So I wrote some comments but wasn't going to repost the patch with the
 unchanged name without explanation... But I think you're right though
 I was looking at it the other way around. I want to have an API for a
 two-stage sync and of course if I do that I'll comment it to explain
 that clearly.

 The gist of the comments was that the function is preparing to fsync
 to initiate the i/o early and allow the later fsync to fast -- but
 also at the same time have the beneficial side-effect of avoiding
 cache poisoning. It's not clear that the two are necessarily linked
 though. Perhaps we need two separate apis, though it'll be hard to
 keep them separate on all platforms.

Well, maybe we should start with a discussion of what kernel calls
you're aware of on different platforms and then we could try to put an
API around it.  I mean, right now all you've got is
POSIX_FADV_DONTNEED, so given just that I feel like the API could
simply be pg_dontneed() or something.  It's hard to design a general
framework based on one example.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-03 Thread Andres Freund

On 02/03/10 14:42, Robert Haas wrote:

On Wed, Feb 3, 2010 at 6:53 AM, Greg Starkgsst...@mit.edu  wrote:

On Tue, Feb 2, 2010 at 7:45 PM, Robert Haasrobertmh...@gmail.com  wrote:

I think you're probably right, but it's not clear what the new name
should be until we have a comment explaining what the function is
responsible for.


So I wrote some comments but wasn't going to repost the patch with the
unchanged name without explanation... But I think you're right though
I was looking at it the other way around. I want to have an API for a
two-stage sync and of course if I do that I'll comment it to explain
that clearly.

The gist of the comments was that the function is preparing to fsync
to initiate the i/o early and allow the later fsync to fast -- but
also at the same time have the beneficial side-effect of avoiding
cache poisoning. It's not clear that the two are necessarily linked
though. Perhaps we need two separate apis, though it'll be hard to
keep them separate on all platforms.


Well, maybe we should start with a discussion of what kernel calls
you're aware of on different platforms and then we could try to put an
API around it.
In linux there is sync_file_range. On newer Posixish systems one can 
emulate that with mmap() and msync() (in batches obviously).


No idea about windows.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
 I took a look at this patch today and I agree with Tom that
 pg_fsync_start() is a very confusing name.  I don't know what the
 right name is, but this doesn't fsync so I don't think it shuld have
 fsync in the name.  Maybe something like pg_advise_abandon() or
 pg_abandon_cache().  The current name is really wishful thinking:
 you're hoping that it will make the kernel start the fsync, but it
 might not.  I think pg_start_data_flush() is similarly optimistic.

 What about: pg_fsync_prepare().

prepare_for_fsync()?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Andres Freund
On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
 On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark gsst...@mit.edu wrote:
  On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  That function *seriously* needs documentation, in particular the fact
  that it's a no-op on machines without the right kernel call.  The name
  you've chosen is very bad for those semantics.  I'd pick something
  else myself.  Maybe pg_start_data_flush or something like that?
  
  I would like to make one token argument in favour of the name I
  picked. If it doesn't convince I'll change it since we can always
  revisit the API down the road.
  
  I envision having two function calls, pg_fsync_start() and
  pg_fsync_finish(). The latter will wait until the data synced in the
  first call is actually synced. The fall-back if there's no
  implementation of this would be for fsync_start() to be a noop (or
  something unreliable like posix_fadvise) and fsync_finish() to just be
  a regular fsync.
  
  I think we can accomplish this with sync_file_range() but I need to
  read up on how it actually works a bit more. In this case it doesn't
  make a difference since when we call fsync_finish() it's going to be
  for the entire file and nothing else will have been writing to these
  files. But for wal writing and checkpointing it might have very
  different performance characteristics.
  
  The big objection to this is that then we don't really have an api for
  FADV_DONT_NEED which is more about cache policy than about syncing to
  disk. So for example a sequential scan might want to indicate that it
  isn't planning on reading the buffers it's churning through but
  doesn't want to force them to be written sooner than otherwise and is
  never going to call fsync_finish().
 
 I took a look at this patch today and I agree with Tom that
 pg_fsync_start() is a very confusing name.  I don't know what the
 right name is, but this doesn't fsync so I don't think it shuld have
 fsync in the name.  Maybe something like pg_advise_abandon() or
 pg_abandon_cache().  The current name is really wishful thinking:
 you're hoping that it will make the kernel start the fsync, but it
 might not.  I think pg_start_data_flush() is similarly optimistic.
What about: pg_fsync_prepare(). That gives the reason why were doing that and 
doesnt promise that it is actually doing an fsync.
I dislike really having cache in the name, because the primary aim is not to 
discard the cache...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Robert Haas
On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andres Freund and...@anarazel.de writes:
 On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
 I took a look at this patch today and I agree with Tom that
 pg_fsync_start() is a very confusing name.  I don't know what the
 right name is, but this doesn't fsync so I don't think it shuld have
 fsync in the name.  Maybe something like pg_advise_abandon() or
 pg_abandon_cache().  The current name is really wishful thinking:
 you're hoping that it will make the kernel start the fsync, but it
 might not.  I think pg_start_data_flush() is similarly optimistic.

 What about: pg_fsync_prepare().

 prepare_for_fsync()?

It still seems mis-descriptive to me.  Couldn't the same routine be
used simply to abandon undirtied data that we no longer care about
caching?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Andres Freund
On Tuesday 02 February 2010 19:14:40 Robert Haas wrote:
 On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Andres Freund and...@anarazel.de writes:
  On Tuesday 02 February 2010 18:36:12 Robert Haas wrote:
  I took a look at this patch today and I agree with Tom that
  pg_fsync_start() is a very confusing name.  I don't know what the
  right name is, but this doesn't fsync so I don't think it shuld have
  fsync in the name.  Maybe something like pg_advise_abandon() or
  pg_abandon_cache().  The current name is really wishful thinking:
  you're hoping that it will make the kernel start the fsync, but it
  might not.  I think pg_start_data_flush() is similarly optimistic.
  
  What about: pg_fsync_prepare().
  
  prepare_for_fsync()?
 
 It still seems mis-descriptive to me.  Couldn't the same routine be
 used simply to abandon undirtied data that we no longer care about
 caching?
For now it could - but it very well might be converted to sync_file_range or 
similar, which would have different sideeffects.

As the potential code duplication is rather small I would prefer to describe 
the prime effect not the sideeffects...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Robert Haas
On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund and...@anarazel.de wrote:
 For now it could - but it very well might be converted to sync_file_range or
 similar, which would have different sideeffects.

 As the potential code duplication is rather small I would prefer to describe
 the prime effect not the sideeffects...

Hmm, in that case, I think the problem is that this function has no
comment explaining its intended charter.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Andres Freund
On Tuesday 02 February 2010 20:06:32 Robert Haas wrote:
 On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund and...@anarazel.de wrote:
  For now it could - but it very well might be converted to sync_file_range
  or similar, which would have different sideeffects.
  
  As the potential code duplication is rather small I would prefer to
  describe the prime effect not the sideeffects...
 
 Hmm, in that case, I think the problem is that this function has no
 comment explaining its intended charter.
I agree there. Greg, do you want to update the patch with some comments or 
shall I?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Robert Haas
On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That function *seriously* needs documentation, in particular the fact
 that it's a no-op on machines without the right kernel call.  The name
 you've chosen is very bad for those semantics.  I'd pick something
 else myself.  Maybe pg_start_data_flush or something like that?


 I would like to make one token argument in favour of the name I
 picked. If it doesn't convince I'll change it since we can always
 revisit the API down the road.

 I envision having two function calls, pg_fsync_start() and
 pg_fsync_finish(). The latter will wait until the data synced in the
 first call is actually synced. The fall-back if there's no
 implementation of this would be for fsync_start() to be a noop (or
 something unreliable like posix_fadvise) and fsync_finish() to just be
 a regular fsync.

 I think we can accomplish this with sync_file_range() but I need to
 read up on how it actually works a bit more. In this case it doesn't
 make a difference since when we call fsync_finish() it's going to be
 for the entire file and nothing else will have been writing to these
 files. But for wal writing and checkpointing it might have very
 different performance characteristics.

 The big objection to this is that then we don't really have an api for
 FADV_DONT_NEED which is more about cache policy than about syncing to
 disk. So for example a sequential scan might want to indicate that it
 isn't planning on reading the buffers it's churning through but
 doesn't want to force them to be written sooner than otherwise and is
 never going to call fsync_finish().

I took a look at this patch today and I agree with Tom that
pg_fsync_start() is a very confusing name.  I don't know what the
right name is, but this doesn't fsync so I don't think it shuld have
fsync in the name.  Maybe something like pg_advise_abandon() or
pg_abandon_cache().  The current name is really wishful thinking:
you're hoping that it will make the kernel start the fsync, but it
might not.  I think pg_start_data_flush() is similarly optimistic.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Hmm, in that case, I think the problem is that this function has no
 comment explaining its intended charter.

That's certainly a big problem, but a comment won't fix the fact that
the name is misleading.  We need both a comment and a name change.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-02-02 Thread Robert Haas
On Tue, Feb 2, 2010 at 2:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Hmm, in that case, I think the problem is that this function has no
 comment explaining its intended charter.

 That's certainly a big problem, but a comment won't fix the fact that
 the name is misleading.  We need both a comment and a name change.

I think you're probably right, but it's not clear what the new name
should be until we have a comment explaining what the function is
responsible for.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-29 Thread Greg Stark
On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That function *seriously* needs documentation, in particular the fact
 that it's a no-op on machines without the right kernel call.  The name
 you've chosen is very bad for those semantics.  I'd pick something
 else myself.  Maybe pg_start_data_flush or something like that?


I would like to make one token argument in favour of the name I
picked. If it doesn't convince I'll change it since we can always
revisit the API down the road.

I envision having two function calls, pg_fsync_start() and
pg_fsync_finish(). The latter will wait until the data synced in the
first call is actually synced. The fall-back if there's no
implementation of this would be for fsync_start() to be a noop (or
something unreliable like posix_fadvise) and fsync_finish() to just be
a regular fsync.

I think we can accomplish this with sync_file_range() but I need to
read up on how it actually works a bit more. In this case it doesn't
make a difference since when we call fsync_finish() it's going to be
for the entire file and nothing else will have been writing to these
files. But for wal writing and checkpointing it might have very
different performance characteristics.

The big objection to this is that then we don't really have an api for
FADV_DONT_NEED which is more about cache policy than about syncing to
disk. So for example a sequential scan might want to indicate that it
isn't planning on reading the buffers it's churning through but
doesn't want to force them to be written sooner than otherwise and is
never going to call fsync_finish().



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-26 Thread Greg Smith

Greg Stark wrote:

Actually before we get there could someone who demonstrated the
speedup verify that this patch still gets that same speedup?
  


Let's step back a second and get to the bottom of why some people are 
seeing this and others aren't.  The original report here suggested this 
was an ext4 issue.  As I pointed out recently on the performance list, 
the reason for that is likely that the working write-barrier support for 
ext4 means it's passing through the fsync to lying hard drives via a 
proper cache flush, which didn't happen on your typical ext3 install.  
Given that, I'd expect I could see the same issue with ext3 given a 
drive with its write cache turned off, so that the theory I started 
trying to prove before seeing the patch operate.


What I did was create a little test program that created 5 databases and 
then dropped them:


\timing
create database a;
create database b;
create database c;
create database d;
create database e;
drop database a;
drop database b;
drop database c;
drop database d;
drop database e;

(All of the drop times were very close by the way; around 100ms, nothing 
particularly interesting there)


If I have my system's boot drive (attached to the motherboard, not on 
the caching controller) in its regular, lying mode with write cache on, 
the creates take the following times:


Time: 713.982 ms  Time: 659.890 ms  Time: 590.842 ms  Time: 675.506 ms  
Time: 645.521 ms


A second run gives similar results; seems quite repeatable for every 
test I ran so I'll just show one run of each.


If I then turn off the write-cache on the drive:

$ sudo hdparm -W 0 /dev/sdb

And repeat, these times show up instead:

Time: 6781.205 ms  Time: 6805.271 ms  Time: 6947.037 ms  Time: 6938.644 
ms  Time: 7346.838 ms


So there's the problem case reproduced, right on regular old ext3 and 
Ubuntu Jaunty:  around 7 seconds to create a database, not real impressive.


Applying the last patch you attached, with the cache on, I see this:

Time: 396.105 ms  Time: 389.984 ms  Time: 469.800 ms  Time: 386.043 ms  
Time: 441.269 ms


And if I then turn the write cache off, back to slow times, but much better:

Time: 2162.687 ms  Time: 2174.057 ms  Time: 2215.785 ms  Time: 2174.100 
ms  Time: 2190.811 ms


That makes the average times I'm seeing on my server:

HEAD  Cached:  657 ms Uncached:  6964 ms
Patched Cached:  417 ms Uncached:  2183 ms

Modest speedup even with a caching drive, and a huge speedup in the case 
when you have one with slow fsync.  Looks to me that if you address 
Tom's concern about documentation and function naming, comitting this 
patch will certainly deliver as promised on the performance side.  Maybe 
2 seconds is still too long for some people, but it's at least a whole 
lot better.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.co


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Greg Stark
On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark gsst...@mit.edu wrote:
 Looking at this patch for the commitfest I have a few questions.

So I've touched this patch up a bit:

1) moved the posix_fadvise call to a new fd.c function
pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
waiting on it. Currently it's only implemented with
posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
in the future -- it looks like this call might be good enough for our
checkpoints.

2) advised each 64k chunk as we write it which should avoid poisoning
the cache if you do a large create database on an active system.

3) added the promised but afaict missing fsync of the directory -- i
think we should actually backpatch this.

Barring any objections shall I commit it like this?


-- 
greg


-- 
greg
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***
*** 319,324  pg_fdatasync(int fd)
--- 319,340 
  		return 0;
  }
  
+ int
+ pg_fsync_start(int fd, off_t offset, off_t amount)
+ {
+ 	if (enableFsync)
+ 	{
+ #if defined(USE_POSIX_FADVISE)  defined(POSIX_FADV_DONTNEED)
+ 		return posix_fadvise(fd, offset, amount, POSIX_FADV_DONTNEED);
+ #else
+ 		return 0;
+ #endif
+ 	}
+ 	else
+ 		return 0;
+ }
+ 
+ 
  /*
   * InitFileAccess --- initialize this module during backend startup
   *
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***
*** 98,103  extern int	pg_fsync(int fd);
--- 98,104 
  extern int	pg_fsync_no_writethrough(int fd);
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
+ extern int	pg_fsync_start(int fd, off_t offset, off_t amount);
  
  /* Filename components for OpenTemporaryFile */
  #define PG_TEMP_FILES_DIR pgsql_tmp
*** a/src/port/copydir.c
--- b/src/port/copydir.c
***
*** 37,42 
--- 37,43 
  
  
  static void copy_file(char *fromfile, char *tofile);
+ static void fsync_fname(char *fname);
  
  
  /*
***
*** 64,69  copydir(char *fromdir, char *todir, bool recurse)
--- 65,73 
  (errcode_for_file_access(),
   errmsg(could not open directory \%s\: %m, fromdir)));
  
+ 	/*
+ 	 * Copy all the files
+ 	 */
  	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
  	{
  		struct stat fst;
***
*** 89,96  copydir(char *fromdir, char *todir, bool recurse)
--- 93,127 
  		else if (S_ISREG(fst.st_mode))
  			copy_file(fromfile, tofile);
  	}
+ 	FreeDir(xldir);
+ 
+ 	/*
+ 	 * Be paranoid here and fsync all files to ensure we catch problems.
+ 	 */
+ 	xldir = AllocateDir(fromdir);
+ 	if (xldir == NULL)
+ 		ereport(ERROR,
+ (errcode_for_file_access(),
+  errmsg(could not open directory \%s\: %m, fromdir)));
+ 
+ 	while ((xlde = ReadDir(xldir, fromdir)) != NULL)
+ 	{
+ 		if (strcmp(xlde-d_name, .) == 0 ||
+ 			strcmp(xlde-d_name, ..) == 0)
+ 			continue;
  
+ 		snprintf(tofile, MAXPGPATH, %s/%s, todir, xlde-d_name);
+ 		fsync_fname(tofile);
+ 	}
  	FreeDir(xldir);
+ 
+ 	/* It's important to fsync the destination directory itself as
+ 	 * individual file fsyncs don't guarantee that the directory entry
+ 	 * for the file is synced. Recent versions of ext4 have made the
+ 	 * window much wider but it's been true for ext3 and other
+ 	 * filesyetems in the past 
+ 	 */
+ 	fsync_fname(todir);
  }
  
  /*
***
*** 103,108  copy_file(char *fromfile, char *tofile)
--- 134,140 
  	int			srcfd;
  	int			dstfd;
  	int			nbytes;
+ 	off_t		offset;
  
  	/* Use palloc to ensure we get a maxaligned buffer */
  #define COPY_BUF_SIZE (8 * BLCKSZ)
***
*** 128,134  copy_file(char *fromfile, char *tofile)
  	/*
  	 * Do the data copying.
  	 */
! 	for (;;)
  	{
  		nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
  		if (nbytes  0)
--- 160,166 
  	/*
  	 * Do the data copying.
  	 */
! 	for (offset=0; ; offset+=nbytes)
  	{
  		nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
  		if (nbytes  0)
***
*** 147,161  copy_file(char *fromfile, char *tofile)
  	(errcode_for_file_access(),
  	 errmsg(could not write to file \%s\: %m, tofile)));
  		}
- 	}
  
! 	/*
! 	 * Be paranoid here to ensure we catch problems.
! 	 */
! 	if (pg_fsync(dstfd) != 0)
! 		ereport(ERROR,
! (errcode_for_file_access(),
!  errmsg(could not fsync file \%s\: %m, tofile)));
  
  	if (close(dstfd))
  		ereport(ERROR,
--- 179,195 
  	(errcode_for_file_access(),
  	 errmsg(could not write to file \%s\: %m, tofile)));
  		}
  
! 		/*
! 		 * Avoid poisoning the cache with the entire database just
! 		 * like we do for large sequential scans. Actually that's a
! 		 * white lie, the real motivation for this is that it might
! 		 * urge the kernel to flush these buffers when convenient
! 		 * before we get to the later fsyncs. Empirical results seem
! 		 * to show a big speed benefit on recent Linux kernels.
! 		 */
! 		pg_fsync_start(dstfd, offset, nbytes);
! 	}
  
  	if (close(dstfd))
  		ereport(ERROR,

[HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Greg Stark
On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark gsst...@mit.edu wrote:
 Barring any objections shall I commit it like this?

Actually before we get there could someone who demonstrated the
speedup verify that this patch still gets that same speedup?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Andres Freund
On Tuesday 19 January 2010 15:52:25 Greg Stark wrote:
 On Mon, Jan 18, 2010 at 4:35 PM, Greg Stark gsst...@mit.edu wrote:
  Looking at this patch for the commitfest I have a few questions.
 
 So I've touched this patch up a bit:
 
 1) moved the posix_fadvise call to a new fd.c function
 pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
 waiting on it. Currently it's only implemented with
 posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
 in the future -- it looks like this call might be good enough for our
 checkpoints.
 
 2) advised each 64k chunk as we write it which should avoid poisoning
 the cache if you do a large create database on an active system.
 
 3) added the promised but afaict missing fsync of the directory -- i
 think we should actually backpatch this.
Yes, that was a bit stupid from me - I added the fsync for directories which 
get recursed into (by not checking if its a file) but not for the uppermost 
level.
So all directories should get fsynced right now but the topmost one.

I will review the patch later when I finally will have some time off again... 
~4h.

Thanks!

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 1) moved the posix_fadvise call to a new fd.c function
 pg_fsync_start(fd,offset,nbytes) which initiates an fsync without
 waiting on it. Currently it's only implemented with
 posix_fadvise(DONT_NEED) but I want to look into using sync_file_range
 in the future -- it looks like this call might be good enough for our
 checkpoints.

That function *seriously* needs documentation, in particular the fact
that it's a no-op on machines without the right kernel call.  The name
you've chosen is very bad for those semantics.  I'd pick something
else myself.  Maybe pg_start_data_flush or something like that?

Other than that quibble it seems basically sane.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-19 Thread Andres Freund
On Tuesday 19 January 2010 15:57:14 Greg Stark wrote:
 On Tue, Jan 19, 2010 at 2:52 PM, Greg Stark gsst...@mit.edu wrote:
  Barring any objections shall I commit it like this?
 
 Actually before we get there could someone who demonstrated the
 speedup verify that this patch still gets that same speedup?
At least on the three machines I tested last time the result is still in the 
same ballpark.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2010-01-18 Thread Greg Stark
Looking at this patch for the commitfest I have a few questions.

1) You said you added an fsync of the new directory -- where is that I
don't see it anywhere.

2) Why does the second pass to do the fsyncs read through fromdir to
find all the filenames. I find that odd and counterintuitive. It would
be much more natural to just loop through the files in the new
directory. But I suppose it serves as an added paranoia check that the
files are in fact still there and we're not fsyncing any files we
didn't just copy. I think it should still work, we should have an
exclusive lock on the template database so there really ought to be no
differences between the directory trees.

3) It would be tempting to do the posix_fadvise on each chunk as we
copy it. That way we avoid poisoning the filesystem cache even as far
as a 1G file. This might actually be quite significant if we're built
without the 1G file chunk size. I'm a bit concerned that the code will
be a big more complex having to depend on a good off_t definition
though. Do we only use 1GB files on systems where off_t is capable of
handling 2^32 without gymnastics?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Greg Stark
On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund and...@anarazel.de wrote:
 fsync everything in that pass.
 Including the directory - which was not done before and actually might be
 necessary in some cases.

Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
but it doesn't look like we're going to convince the ext4 developers
they're crazy any day soon and it would really suck for a database
created from a template to have files in it go missin.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)

2009-12-28 Thread Andres Freund
On Tuesday 29 December 2009 01:27:29 Greg Stark wrote:
 On Mon, Dec 28, 2009 at 10:54 PM, Andres Freund and...@anarazel.de wrote:
  fsync everything in that pass.
  Including the directory - which was not done before and actually might be
  necessary in some cases.
 
 Er. Yes. At least on ext4 this is pretty important. I wish it weren't,
 but it doesn't look like we're going to convince the ext4 developers
 they're crazy any day soon and it would really suck for a database
 created from a template to have files in it go missin.
Actually it was necessary on ext3 as well - the window to hit the problem just 
was much smaller, wasnt it?

Actually that part should possibly get backported.


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers