Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-05 Thread Jeff Davis
On Mon, 2013-07-01 at 00:52 -0400, Greg Smith wrote:
 On 6/30/13 9:28 PM, Jon Nelson wrote:
  The performance of the latter (new) test sometimes seems to perform
  worse and sometimes seems to perform better (usually worse) than
  either of the other two. In all cases, posix_fallocate performs
  better, but I don't have a sufficiently old kernel to test with.
 
 This updated test program looks reliable now.  The numbers are very 
 tight when I'd expect them to be, and there's nowhere with the huge 
 differences I saw in the earlier test program.
 
 Here's results from a few sets of popular older platforms:

...

 The win for posix_fallocate is there in most cases, but it's pretty hard 
 to see in these older systems.  That could be OK.  As long as the 
 difference is no more than noise, and that is the case, this could be 
 good enough to commit.  If there are significantly better results on the 
 new platforms, the old ones need to just not get worse.

I ran my own little test on my workstation[1] with the attached
programs. One does what we do now, another mimics the glibc emulation
you described earlier, and another uses posix_fallocate(). It does an
allocation phase, an fsync, a single rewrite, and then another fsync.
The program runs this 64 times for 64 different 16MB files.

write1 and write2 are almost exactly even at 25.5s. write3 is about
14.5s, which is a pretty big improvement.

So, my simple conclusion is that glibc emulation should be about the
same as what we're doing now, so there's no reason to avoid it. That
means, if posix_fallocate() is present, we should use it, because it's
either the same (if emulated in glibc) or significantly faster (if
implemented in the kernel).

If that is your conclusion, as well, it looks like this patch is about
ready for commit. What do you think?

Regards,
Jeff Davis

[1] workstation specs (ubuntu 12.10, ext4):
$ uname -a
Linux jdavis 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux


#include fcntl.h
#include stdio.h
#include unistd.h

char buf[8192] = {0};

int main()
{
	int i;
	int filenum;
	char filename[64];

	for (filenum = 0; filenum  64; filenum++)
		{
			sprintf(filename, /tmp/afile.%02d, filenum);
			int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

			// allocate all 16MB
			for(i = 0; i  2048; i++)
{
	write(fd, buf, 8192);
}

			fsync(fd);

			lseek(fd, 0, SEEK_SET);

			// now overwrite it
			for(i = 0; i  2048; i++)
{
	write(fd, buf, 8192);
}

			fsync(fd);

			close(fd);
		}

	return 0;
}

#include fcntl.h
#include stdio.h
#include unistd.h

char buf[8192] = {0};

int main()
{
	int i;
	int filenum;
	char filename[64];

	for (filenum = 0; filenum  64; filenum++)
		{
			sprintf(filename, /tmp/afile.%02d, filenum);
			int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

			// allocate all 16MB
			posix_fallocate(fd, 0, 16 * 1024 * 1024);

			fsync(fd);

			lseek(fd, 0, SEEK_SET);

			// now overwrite it
			for(i = 0; i  2048; i++)
{
	write(fd, buf, 8192);
}

			fsync(fd);
			close(fd);
		}

	return 0;
}

#include fcntl.h
#include stdio.h
#include unistd.h

char buf[8192] = {0};

int main()
{
	int i;
	int filenum;
	char filename[64];

	for (filenum = 0; filenum  64; filenum++)
		{
			sprintf(filename, /tmp/afile.%02d, filenum);
			int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

			// allocate all 16MB
			for(i = 0; i  4096; i++)
{
	pwrite(fd, \0, 4, 4096*(i+1) - 1);
}

			fsync(fd);

			lseek(fd, 0, SEEK_SET);

			// now overwrite it
			for(i = 0; i  2048; i++)
{
	write(fd, buf, 8192);
}

			fsync(fd);

			close(fd);
		}

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-05 Thread Greg Smith

On 7/5/13 2:50 AM, Jeff Davis wrote:

So, my simple conclusion is that glibc emulation should be about the
same as what we're doing now, so there's no reason to avoid it. That
means, if posix_fallocate() is present, we should use it, because it's
either the same (if emulated in glibc) or significantly faster (if
implemented in the kernel).


That's what I'm seeing everywhere too.  I'm happy that we've spent 
enough time chasing after potential issues without finding anything now. 
 Pull out the GUC that was added for default and this is ready to commit.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-05 Thread Jon Nelson
On Fri, Jul 5, 2013 at 2:23 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/5/13 2:50 AM, Jeff Davis wrote:

 So, my simple conclusion is that glibc emulation should be about the
 same as what we're doing now, so there's no reason to avoid it. That
 means, if posix_fallocate() is present, we should use it, because it's
 either the same (if emulated in glibc) or significantly faster (if
 implemented in the kernel).


 That's what I'm seeing everywhere too.  I'm happy that we've spent enough
 time chasing after potential issues without finding anything now.  Pull out
 the GUC that was added for default and this is ready to commit.

Wonderful. Is removing the GUC something that I should do or should
that be done by somebody that knows more about what they are doing? (I
am happy to give it a go!)

Should the small test program that I made also be included somewhere
in the source tree?

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-05 Thread Jeff Davis
On Fri, 2013-07-05 at 08:21 -0500, Jon Nelson wrote:
 Wonderful. Is removing the GUC something that I should do or should
 that be done by somebody that knows more about what they are doing? (I
 am happy to give it a go!)

I'll take care of that.

 Should the small test program that I made also be included somewhere
 in the source tree?

I don't think that's necessary, the fact that it's in the mailing list
archives is enough.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Jon Nelson
On Sun, Jun 30, 2013 at 11:52 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 6/30/13 9:28 PM, Jon Nelson wrote:

 The performance of the latter (new) test sometimes seems to perform
 worse and sometimes seems to perform better (usually worse) than
 either of the other two. In all cases, posix_fallocate performs
 better, but I don't have a sufficiently old kernel to test with.


 This updated test program looks reliable now.  The numbers are very tight
 when I'd expect them to be, and there's nowhere with the huge differences I
 saw in the earlier test program.

 Here's results from a few sets of popular older platforms:

If you found yourself with a spare moment, could you run these again
with the number of open/close cycles set high (say, 100) and the
number of rewrites set to 0 and also to 1? Most of the time spent is
actually spent overwriting the files so by reducing or eliminating
that aspect it might be easier to get a handle on the actual
performance difference.



--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Jeff Davis
On Sun, 2013-06-30 at 16:09 -0400, Andrew Dunstan wrote:
 It was originally generated. Since then it's been maintained by hand.

What is the procedure for maintaining it by hand? Why are
HAVE_POSIX_SIGNALS and HAVE_SYNC_FILE_RANGE in there (though commented
out), but not HAVE_POSIX_FADVISE?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Jeff Davis
On Sun, 2013-06-30 at 18:55 -0400, Greg Smith wrote:
 This makes platform level testing a lot easier, thanks.  Attached is an 
 updated copy of that program with some error checking.  If the files it 
 creates already existed, the code didn't notice, and a series of write 
 errors happened.  If you set the test up right it's not a problem, but 
 it's better if a bad setup is caught.  I wrapped the whole test with a 
 shell script, also attached, which insures the right test sequence and 
 checks.

Thank you.

 That's glibc helpfully converting your call to posix_fallocate into 
 small writes, because the OS doesn't provide a better way in that 
 kernel.  It's not hard to imagine this being slower than what the WAL 
 code is doing right now.  I'm not worried about correctness issues 
 anymore, but my gut paranoia about this not working as expected on older 
 systems was justified.  Everyone who thought I was just whining owes me 
 a cookie.

So your theory is that it may be slower because there are twice as many
syscalls (one per 4K page rather than one per 8K page)? Interesting
observation.

 This is what I plan to benchmark specifically next.

In the interest of keeping this patch moving forward, do you have an
estimate for when this testing will be complete?

   If the 
 posix_fallocate approach is actually slower than what's done now when 
 it's not getting kernel acceleration, which is the case on RHEL5 era 
 kernels, we might need to make the configure time test more complicated. 
   Whether posix_fallocate is defined isn't sensitive enough; on Linux it 
 may be the case that this only is usable when fallocate() is also there.

I'd say that if posix_fallocate is slower than the existing code on
pretty much any platform, we shouldn't commit the patch at all. I would
be quite surprised if that was the case, however.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Fujii Masao
On Tue, Jul 2, 2013 at 1:55 AM, Jeff Davis pg...@j-davis.com wrote:
 On Sun, 2013-06-30 at 18:55 -0400, Greg Smith wrote:
 This makes platform level testing a lot easier, thanks.  Attached is an
 updated copy of that program with some error checking.  If the files it
 creates already existed, the code didn't notice, and a series of write
 errors happened.  If you set the test up right it's not a problem, but
 it's better if a bad setup is caught.  I wrapped the whole test with a
 shell script, also attached, which insures the right test sequence and
 checks.

 Thank you.

 That's glibc helpfully converting your call to posix_fallocate into
 small writes, because the OS doesn't provide a better way in that
 kernel.  It's not hard to imagine this being slower than what the WAL
 code is doing right now.  I'm not worried about correctness issues
 anymore, but my gut paranoia about this not working as expected on older
 systems was justified.  Everyone who thought I was just whining owes me
 a cookie.

 So your theory is that it may be slower because there are twice as many
 syscalls (one per 4K page rather than one per 8K page)? Interesting
 observation.

 This is what I plan to benchmark specifically next.

 In the interest of keeping this patch moving forward, do you have an
 estimate for when this testing will be complete?

   If the
 posix_fallocate approach is actually slower than what's done now when
 it's not getting kernel acceleration, which is the case on RHEL5 era
 kernels, we might need to make the configure time test more complicated.
   Whether posix_fallocate is defined isn't sensitive enough; on Linux it
 may be the case that this only is usable when fallocate() is also there.

 I'd say that if posix_fallocate is slower than the existing code on
 pretty much any platform, we shouldn't commit the patch at all.

Even in that case, if a user can easily know which platform posix_fallocate
should be used in, we can commit the patch with the configurable GUC
parameter.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Jeff Davis
On Tue, 2013-07-02 at 02:13 +0900, Fujii Masao wrote:
 Even in that case, if a user can easily know which platform posix_fallocate
 should be used in, we can commit the patch with the configurable GUC
 parameter.

I disagree here. We're not talking about a huge win; this speedup may
not even be detectable on a running system.

I think Robert summarized the reason for the patch best: I mean, if
posix_fallocate() is faster, then it's just faster, right?. But if we
need a new GUC, and DBAs now have one more thing they need to test about
their platform, then that argument goes out the window.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Robert Haas
On Mon, Jul 1, 2013 at 2:17 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-07-02 at 02:13 +0900, Fujii Masao wrote:
 Even in that case, if a user can easily know which platform posix_fallocate
 should be used in, we can commit the patch with the configurable GUC
 parameter.

 I disagree here. We're not talking about a huge win; this speedup may
 not even be detectable on a running system.

 I think Robert summarized the reason for the patch best: I mean, if
 posix_fallocate() is faster, then it's just faster, right?. But if we
 need a new GUC, and DBAs now have one more thing they need to test about
 their platform, then that argument goes out the window.

Yeah.  If the patch isn't going to be a win on RHEL 5, I'd consider
that a good reason to scrap it for now and revisit it in 3 years.
There are still a LOT of people running RHEL 5, and the win isn't big
enough to engineer a more complex solution.  But ultimately RHEL 5,
like any other old system, will go away.

However, Greg's latest testing results seem to show that there isn't
much to worry about, so we may be fine anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Peter Eisentraut
On 7/1/13 12:34 PM, Jeff Davis wrote:
 On Sun, 2013-06-30 at 16:09 -0400, Andrew Dunstan wrote:
 It was originally generated. Since then it's been maintained by hand.
 
 What is the procedure for maintaining it by hand?

Edit away.

 Why are
 HAVE_POSIX_SIGNALS and HAVE_SYNC_FILE_RANGE in there (though commented
 out), but not HAVE_POSIX_FADVISE?

Because maintaining it by hand is prone to inconsistencies.  I wouldn't
worry about it.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-07-01 Thread Greg Smith

On 7/1/13 3:44 PM, Robert Haas wrote:

Yeah.  If the patch isn't going to be a win on RHEL 5, I'd consider
that a good reason to scrap it for now and revisit it in 3 years.
There are still a LOT of people running RHEL 5, and the win isn't big
enough to engineer a more complex solution.


I'm still testing, expect to have this wrapped up on my side by 
tomorrow.  So much of the runtime here is the file setup/close that 
having a 2:1 difference in number of writes, what happens on the old 
platforms, it is hard to get excited about.


I don't think the complexity to lock out RHEL5 here is that bad even if 
it turns out to be a good idea.  Just add another configure check for 
fallocate, and on Linux if it's not there don't use posix_fallocate 
either.  Maybe 5 lines of macro code?  RHEL5 sure isn't going anyway 
anytime soon, but at the same time there won't be that many 9.4 
deployments on that version.


I've been digging into the main situation where this feature helps, and 
it won't be easy to duplicate in a benchmark situation.  Using Linux's 
fallocate works as a hint that the whole 16MB should be allocated at 
once, and therefore together on disk if feasible.  The resulting WAL 
files should be less prone to fragmentation.  That's actually the 
biggest win of this approach, but I can't easily duplicate the sort of 
real-world fragmentation I see on live servers here.Given that, I'm 
leaning toward saying that unless there's a clear regression on older 
platforms, above the noise floor, this is still the right thing to do.


I fully agree that this needs to fully automatic--no GUC--before it's 
worth committing.  If we can't figure out the right thing to do now, 
there's little hope anyone else will in a later tuning expedition.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jeff Davis
On Tue, 2013-05-28 at 22:10 -0400, Greg Smith wrote:
 I was just thinking of something to run in your test program, not 
 another build time check.  Just run the new allocation sequence, and 
 then check the resulting WAL file for a) correct length, and b) 16K of 
 zero bytes.  I would like to build some confidence that posix_fallocate 
 is operating correctly in this context on at least one platform.  My 
 experience with Linux handling this class of functions correctly has 
 left me skeptical of them working until that's proven to be the case.

As I understand it, you are basically asking if posix_fallocate() works
at all anywhere.

Simple test program attached, which creates two files and fills them:
one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
I just cmp the resulting files (and also ls them, to make sure they
are 16MB).

Passes on my workstation:
$ uname -a
Linux jdavis 3.5.0-34-generic #55-Ubuntu SMP Thu Jun 6 20:18:19 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux

Regards,
Jeff Davis


#include fcntl.h

char buf[8192] = {0};

int main()
{
	int i;
	int fda = open(/tmp/afile, O_CREAT | O_EXCL | O_WRONLY, 0600);
	int fdb = open(/tmp/bfile, O_CREAT | O_EXCL | O_WRONLY, 0600);

	for(i = 0; i  2048; i++)
		{
			write(fda, buf, 8192);
		}

	posix_fallocate(fdb, 0, 16*1024*1024);

	close(fda);
	close(fdb);

	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jeff Davis
On Fri, 2013-06-28 at 11:38 -0700, Josh Berkus wrote:
 Since Greg seems to be busy, what needs to be done to test this?

As I understand it, he was mainly asking if posix_fallocate works at
all. I tried to address that question with a simple test, which behaves
as I expected it to:

http://www.postgresql.org/message-id/1372615313.19747.13.camel@jdavis

Unless something surprising comes up, or someone thinks and objection
has been missed, I am going to commit this soon.

(Of course, to avoid your wrath, I'll get rid of the GUC which was added
for testing.)

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jeff Davis
On Sun, 2013-06-30 at 11:11 -0700, Jeff Davis wrote:
 Unless something surprising comes up, or someone thinks and objection
 has been missed, I am going to commit this soon.

Quick question to anyone who happens to know:

What is the standard procedure for changes to pg_config.h.win32? I
looked at an old patch of mine that Tom (CC'd) committed:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b966dd6c4228d696b291c1cdcb5ab8c8475fefa8

and I see that it modifies pg_config.h.win32. Should I modify it in a
similar way for this fallocate patch?

Right now, pg_config.in.win32 seems a little inconsistent because it has
an entry for HAVE_POSIX_SIGNALS but not HAVE_POSIX_FADVISE. It says it's
a generated file, but I don't have a windows environment or MingW.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Andrew Dunstan


On 06/30/2013 03:50 PM, Jeff Davis wrote:

On Sun, 2013-06-30 at 11:11 -0700, Jeff Davis wrote:

Unless something surprising comes up, or someone thinks and objection
has been missed, I am going to commit this soon.

Quick question to anyone who happens to know:

What is the standard procedure for changes to pg_config.h.win32? I
looked at an old patch of mine that Tom (CC'd) committed:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b966dd6c4228d696b291c1cdcb5ab8c8475fefa8

and I see that it modifies pg_config.h.win32. Should I modify it in a
similar way for this fallocate patch?

Right now, pg_config.in.win32 seems a little inconsistent because it has
an entry for HAVE_POSIX_SIGNALS but not HAVE_POSIX_FADVISE. It says it's
a generated file, but I don't have a windows environment or MingW.




It was originally generated. Since then it's been maintained by hand.

If you need a Windows environment to test on, see the Amazon recipe at 
http://wiki.postgresql.org/wiki/Building_With_MinGW


cheers

andrew









--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Greg Smith

On 6/30/13 2:01 PM, Jeff Davis wrote:

Simple test program attached, which creates two files and fills them:
one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
I just cmp the resulting files (and also ls them, to make sure they
are 16MB).


This makes platform level testing a lot easier, thanks.  Attached is an 
updated copy of that program with some error checking.  If the files it 
creates already existed, the code didn't notice, and a series of write 
errors happened.  If you set the test up right it's not a problem, but 
it's better if a bad setup is caught.  I wrapped the whole test with a 
shell script, also attached, which insures the right test sequence and 
checks.


Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X 
Darwin.  No surprises there, there's a long list of platforms that don't 
support this call at 
https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html 
and the Mac is on it.  Many other platforms I was worried about don't 
support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so 
that cuts down on testing quite a bit.  If it runs faster on Linux, 
that's the main target here, just like the existing 
effective_io_concurrency fadvise code.


The specific thing I was worried about is that this interface might have 
a stub that doesn't work perfectly in older Linux kernels.  After being 
surprised to find this interface worked on RHEL5 with your test program, 
I dug into this more.  It works there, but it may actually be slower.


posix_fallocate is actually implemented by glibc on Linux.  Been there 
since 2.1.94 according to the Linux man pages.  But Linux itself didn't 
add the feature until kernel 2.6.20:  http://lwn.net/Articles/226436/ 
The biggest thing I was worried about--the call might be there in early 
kernels but with a non-functional implementation--that's not the case. 
Looking at the diff, before that patch there's no fallocate at all.


So what happened in earlier kernels, where there was no kernel level 
fallocate available?  According to 
https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html 
what glibc does is check for kernel fallocate(), and if it's not there 
it writes a bunch of zeros to create the file instead.  What is actually 
happening on a RHEL5 system (with kernel 2.6.18) is that calling 
posix_fallocate does this fallback behavior, where it basically does the 
same thing the existing WAL clearing code does.


I can even prove that's the case.  On RHEL5, if you run strace -o out 
./fallocate the main write loop looks like this:


write(3, 
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
8192) = 8192


But when you call posix_fallocate, you still get a bunch of writes, but 
4 bytes at a time:


pwrite(4, \0, 1, 16769023)= 1
pwrite(4, \0, 1, 16773119)= 1
pwrite(4, \0, 1, 16777215)= 1

That's glibc helpfully converting your call to posix_fallocate into 
small writes, because the OS doesn't provide a better way in that 
kernel.  It's not hard to imagine this being slower than what the WAL 
code is doing right now.  I'm not worried about correctness issues 
anymore, but my gut paranoia about this not working as expected on older 
systems was justified.  Everyone who thought I was just whining owes me 
a cookie.


This is what I plan to benchmark specifically next.  If the 
posix_fallocate approach is actually slower than what's done now when 
it's not getting kernel acceleration, which is the case on RHEL5 era 
kernels, we might need to make the configure time test more complicated. 
 Whether posix_fallocate is defined isn't sensitive enough; on Linux it 
may be the case that this only is usable when fallocate() is also there.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
#!/bin/sh
rm -f fallocate /tmp/afile /tmp/bfile
gcc fallocate.c -o fallocate
if [ ! -x fallocate ] ; then
  echo Test program did not compile, posix_fallocate may not be supported
  exit
fi

./fallocate
if [ -f /tmp/afile ] ; then
  sizea=`du /tmp/afile | cut -f 1`
  sizeb=`du /tmp/bfile | cut -f 1`
  if [ $sizea -eq $sizeb ] ; then
cmp /tmp/afile /tmp/bfile
if [ $? -ne 0 ] ; then
  echo Test failed, files do not match
else
  echo Test passed
fi
  else
echo Test failed, sizes do not match
  fi
fi
#include fcntl.h
#include stdio.h

char buf[8192] = {0};

int main()
{
	int i;
	int written;
	int fda = open(/tmp/afile, O_CREAT | O_EXCL | O_WRONLY, 0600);
	int fdb = open(/tmp/bfile, O_CREAT | O_EXCL | O_WRONLY, 0600);
	if (fda  0 || fdb  0)
	{
		printf(Opening files failed\n);
		return(1);
	}
	for(i = 0; i  2048; i++)
		{
			written=write(fda, buf, 8192);
			if (written  8192)
			{
printf(Write to file failed);
return(2);
			}
		}

	posix_fallocate(fdb, 0, 16*1024*1024);

	close(fda);
	close(fdb);

	return 0;
}

-- 

Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jon Nelson
On Sun, Jun 30, 2013 at 5:55 PM, Greg Smith g...@2ndquadrant.com wrote:


 pwrite(4, \0, 1, 16769023)= 1
 pwrite(4, \0, 1, 16773119)= 1
 pwrite(4, \0, 1, 16777215)= 1

 That's glibc helpfully converting your call to posix_fallocate into small
 writes, because the OS doesn't provide a better way in that kernel.  It's
 not hard to imagine this being slower than what the WAL code is doing right
 now.  I'm not worried about correctness issues anymore, but my gut paranoia
 about this not working as expected on older systems was justified.  Everyone
 who thought I was just whining owes me a cookie.

I had noted in the very early part of the thread that glibc emulates
posix_fallocate when the (Linux-specific) 'fallocate' systemcall
fails. In this case, it's writing 4 bytes of zeros and then
essentially seeking forward 4092 (4096-4) bytes. This prevents files
with holes in them because the holes have to be at least 4kiB in size,
if I recall properly. It's *not* writing out 16MiB in 4 byte
increments.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Greg Smith

On 5/28/13 10:00 PM, Jon Nelson wrote:


A note: The attached test program uses *fsync* instead of *fdatasync*
after calling fallocate (or writing out 16MB of zeroes), per an
earlier suggestion.


I tried this out on the RHEL5 platform I'm worried about now.  There's 
something weird about the test program there.  If I run it once it shows 
posix_fallocate running much faster:


without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

The problem is that I'm seeing the gap between the two get smaller the 
more iterations I run, which makes me wonder if the test is completely fair:


without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s

without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s

without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s

You didn't show any output from the latest program on your system, so 
I'm not sure how it behaved for you here.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Jon Nelson
On Sun, Jun 30, 2013 at 6:49 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 5/28/13 10:00 PM, Jon Nelson wrote:

 A note: The attached test program uses *fsync* instead of *fdatasync*
 after calling fallocate (or writing out 16MB of zeroes), per an
 earlier suggestion.


 I tried this out on the RHEL5 platform I'm worried about now.  There's
 something weird about the test program there.  If I run it once it shows
 posix_fallocate running much faster:

 without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
 with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

Assuming the platform chosen is using the glibc approach of pwrite(4
bytes) every 4KiB, then the results ought to be similar, and I'm at a
loss to explain why it's performing better (unless - grasping at
straws - simply the *volume* of data transferred from userspace to the
kernel is at play, in which case posix_fallocate will result in 4096
calls to pwrite but at 4 bytes each versus 2048 calls to write at 8KiB
each.) Ultimately the same amount of data gets written to disk (one
would imagine), but otherwise I can't really think of much.

I have also found several errors test_fallocate.c program I posted,
corrected below.
One of them is: it is missing two pairs of parentheses around two #defines:

#define SIXTEENMB 1024*1024*16
#define EIGHTKB 1024*8

should be:

#define SIXTEENMB (1024*1024*16)
#define EIGHTKB (1024*8)

Otherwise the program will end up writing (131072) 8KiB blocks instead of 2048.

This actually makes the comparison between writing 8KiB blocks and
using posix_fallocate favor the latter more strongly in the results
(also seen below).

 The problem is that I'm seeing the gap between the two get smaller the more
 iterations I run, which makes me wonder if the test is completely fair:

 without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
 with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s


 without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
 with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s

 without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
 with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s

 You didn't show any output from the latest program on your system, so I'm
 not sure how it behaved for you here.

On the the platform I use - openSUSE (12.3, x86_64, kernel 3.9.7
currently) I never see posix_fadvise perform worse. Typically better,
sometimes much better.


To set the number of times the file is overwritten to just 1 (one):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done

I am including a revised version of test_fallocate.c that corrects the
above noted error, one typo (from when I changed fdatasync to fsync)
that did not alter program behavior, corrects a mis-placed
gettimeofday (which does change the results) and includes a new test
that aims (perhaps poorly) to emulate the glibc style of pwrite(4
bytes) for every 4KiB, and tests the resulting file size to make sure
it is 16MiB in size.

The performance of the latter (new) test sometimes seems to perform
worse and sometimes seems to perform better (usually worse) than
either of the other two. In all cases, posix_fallocate performs
better, but I don't have a sufficiently old kernel to test with.

The new results on one machine are below.

With 0 (zero) rewrites (testing *just*
open/some_type_of_allocation/fsync/close):

method: classic. 100 open/close iterations, 0 rewrite in 29.6060s
method: posix_fallocate. 100 open/close iterations, 0 rewrite in 2.1054s
method: glibc emulation. 100 open/close iterations, 0 rewrite in 31.7445s

And with the same number of rewrites as open/close cycles:

method: classic. 1 open/close iterations, 1 rewrite in 0.6297s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3028s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.5521s

method: classic. 2 open/close iterations, 2 rewrite in 1.6455s
method: posix_fallocate. 2 open/close iterations, 2 rewrite in 1.0409s
method: glibc emulation. 2 open/close iterations, 2 rewrite in 1.5604s

method: classic. 5 open/close iterations, 5 rewrite in 7.5916s
method: posix_fallocate. 5 open/close iterations, 5 rewrite in 6.9177s
method: glibc emulation. 5 open/close iterations, 5 rewrite in 8.1137s

method: classic. 10 open/close iterations, 10 rewrite in 29.2816s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 28.4400s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 31.2693s



--
Jon
#include stdlib.h
#include stdio.h
#include fcntl.h
#include unistd.h
#include string.h
#include sys/time.h
#include sys/types.h
#include sys/stat.h

#define SIXTEENMB (1024*1024*16)
#define FOURKB (1024*4)
#define EIGHTKB (1024*8)

void writeout(int fd, char *buf)
{
	int i;
	for (i = 0; i  SIXTEENMB / EIGHTKB; ++i) {
		if (write(fd, buf, EIGHTKB) != EIGHTKB) {
			fprintf(stderr, Error in write: %m!\n);
			

Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-30 Thread Greg Smith

On 6/30/13 9:28 PM, Jon Nelson wrote:

The performance of the latter (new) test sometimes seems to perform
worse and sometimes seems to perform better (usually worse) than
either of the other two. In all cases, posix_fallocate performs
better, but I don't have a sufficiently old kernel to test with.


This updated test program looks reliable now.  The numbers are very 
tight when I'd expect them to be, and there's nowhere with the huge 
differences I saw in the earlier test program.


Here's results from a few sets of popular older platforms:

RHEL5, ext3

method: classic. 10 open/close iterations, 10 rewrite in 22.6949s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 23.0113s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 22.4921s

method: classic. 10 open/close iterations, 10 rewrite in 23.2808s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.4736s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.9871s

method: classic. 10 open/close iterations, 10 rewrite in 22.4812s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 22.2393s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 23.6940s

RHEL6, ext4

method: classic. 10 open/close iterations, 10 rewrite in 56.0483s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 61.5092s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 53.8569s

method: classic. 10 open/close iterations, 10 rewrite in 57.0361s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 55.9840s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 64.9437sb


RHEL6, ext3

method: classic. 10 open/close iterations, 10 rewrite in 14.4080s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.1395s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 16.9657s

method: classic. 10 open/close iterations, 10 rewrite in 15.2825s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 16.5315s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 14.8115s

The win for posix_fallocate is there in most cases, but it's pretty hard 
to see in these older systems.  That could be OK.  As long as the 
difference is no more than noise, and that is the case, this could be 
good enough to commit.  If there are significantly better results on the 
new platforms, the old ones need to just not get worse.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-28 Thread Josh Berkus
On 06/20/2013 07:19 PM, Jeff Davis wrote:
 On Sun, 2013-06-16 at 23:53 -0500, Jon Nelson wrote:
 Please find attached v5 of the patch, with the above correction in place.
 The only change from the v4 patch is wrapping the if
 (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
 Thanks!
 
 Thank you.
 
 Greg, are you still working on those tests?

Since Greg seems to be busy, what needs to be done to test this?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-20 Thread Jeff Davis
On Sun, 2013-06-16 at 23:53 -0500, Jon Nelson wrote:
 Please find attached v5 of the patch, with the above correction in place.
 The only change from the v4 patch is wrapping the if
 (wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
 Thanks!

Thank you.

Greg, are you still working on those tests?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-16 Thread Jon Nelson
On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis pg...@j-davis.com wrote:
 On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
 Ack.  I've revised the patch to always have the GUC (for now), default
 to false, and if configure can't find posix_fallocate (or the user
 disables it by way of pg_config_manual.h) then it remains a GUC that
 simply can't be changed.

 Why have a GUC here at all? Perhaps this was already discussed, and I
 missed it? Is it just for testing purposes, or did you intend for it to
 be in the final version?

As Greg Smith noted, right now it's only for testing purposes.

 * The other code assumes that no errno means ENOSPC. We should be
 consistent about that assumption, and do the same thing in the fallocate
 case.

Unlike write(2), posix_fallocate does *not* set errno, but instead
returns a subset of the errno values as it's return code. The current
approach was suggested to me by Andres Freund and Alvaro Herrera.

 * You check for the presence of posix_fallocate at configure time, but
 don't #ifdef the call site. It looks like you removed this from the v2
 patch, was there a reason for that? Won't that cause build errors for
 platforms without it?

Indeed! I believe I have corrected the error and will be sending out a
revised patch (v5), once the performance and back-testing have
completed.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-16 Thread Jon Nelson
On Sun, Jun 16, 2013 at 9:59 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Fri, Jun 14, 2013 at 12:06 PM, Jeff Davis pg...@j-davis.com wrote:
 On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
..
 * You check for the presence of posix_fallocate at configure time, but
 don't #ifdef the call site. It looks like you removed this from the v2
 patch, was there a reason for that? Won't that cause build errors for
 platforms without it?

 Indeed! I believe I have corrected the error and will be sending out a
 revised patch (v5), once the performance and back-testing have
 completed.

Please find attached v5 of the patch, with the above correction in place.
The only change from the v4 patch is wrapping the if
(wal_use_fallocate) block in an #ifdef USE_POSIX_FALLOCATE.
Thanks!

--
Jon


fallocate-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Jeff Davis
On Sat, 2013-05-25 at 13:55 -0500, Jon Nelson wrote:
 Ack.  I've revised the patch to always have the GUC (for now), default
 to false, and if configure can't find posix_fallocate (or the user
 disables it by way of pg_config_manual.h) then it remains a GUC that
 simply can't be changed.

Why have a GUC here at all? Perhaps this was already discussed, and I
missed it? Is it just for testing purposes, or did you intend for it to
be in the final version?

If it's supported, it seems like we always want it. I doubt there are
cases where it hurts performance; but if there are, it's pretty hard for
a DBA to know what those cases are, anyway.

Also:

* The other code assumes that no errno means ENOSPC. We should be
consistent about that assumption, and do the same thing in the fallocate
case.

* You check for the presence of posix_fallocate at configure time, but
don't #ifdef the call site. It looks like you removed this from the v2
patch, was there a reason for that? Won't that cause build errors for
platforms without it?

I started looking at this patch and it looks like we are getting a
consensus that it's the right approach. Microbenchmarks appear to show a
benefit, and (thanks to Noah's comment) it seems like the change is
safe. Are there any remaining questions or objections?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Greg Smith

On 6/14/13 1:06 PM, Jeff Davis wrote:

Why have a GUC here at all? Perhaps this was already discussed, and I
missed it? Is it just for testing purposes, or did you intend for it to
be in the final version?


You have guessed correctly!  I suggested it stay in there only to make 
review benchmarking easier.



I started looking at this patch and it looks like we are getting a
consensus that it's the right approach. Microbenchmarks appear to show a
benefit, and (thanks to Noah's comment) it seems like the change is
safe. Are there any remaining questions or objections?


I'm planning to duplicate Jon's test program on a few machines here, and 
then see if that turns into a useful latency improvement for clients. 
I'm trying to get this pgbench rate limit stuff working first though, 
because one of the tests I had in mind for WAL creation overhead would 
benefit from it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Jeff Davis
On Tue, 2013-06-11 at 12:58 -0400, Stephen Frost wrote:
 My main question is really- would this be useful for extending
 *relations*?  Apologies if it's already been discussed; I do plan to go
 back and read the threads about this more fully, but I wanted to voice
 my support for using posix_fallocate, when available, in general.

+1, though separate from this patch.

Andres also pointed out that we can try to track a point in the file
that is below any place where a zero page might still exist. That will
allow us to call zero pages invalid unless they are related to a recent
extension, which is a weakness in the current checksums code.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-14 Thread Jeff Davis
On Fri, 2013-06-14 at 13:21 -0400, Greg Smith wrote:
 I'm planning to duplicate Jon's test program on a few machines here, and 
 then see if that turns into a useful latency improvement for clients. 
 I'm trying to get this pgbench rate limit stuff working first though, 
 because one of the tests I had in mind for WAL creation overhead would 
 benefit from it.

Great, I'll wait on those results.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-12 Thread Robert Haas
On Tue, Jun 11, 2013 at 12:58 PM, Stephen Frost sfr...@snowman.net wrote:
 * Merlin Moncure (mmonc...@gmail.com) wrote:
 It's understood that posix_fallocate is faster at this -- the question
 on the table is 'does this matter in context of postgres?'.
 Personally I think this patch should go in regardless -- the concerns
 made IMNSHO are specious.

 I've not had a chance to look at this patch, but I tend to agree with
 Merlin.

I also think this is a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Jon Nelson
There hasn't been much activity here recently.  I'm curious, then, if
there are questions that I can answer.
It may be useful to summarize some things here:

- the purpose of the patch is to use posix_fallocate when creating new
WAL files, because it's (usually) much quicker
- using posix_fallocate is *one* system call versus 2048 calls to write(2)
- additionally, using posix_fallocate /guarantees/ that the filesystem
has space for the WAL file (by spec)
- reportedly (difficult to test or prove), using posix_fallocate *may*
reduce file fragmentation
- the (limited) testing I've done bears this out: the more new WAL
file creation there is, the more the improvement. Once the number of
WAL files reaches a constant point, there does not appear to be either
a positive or a negative performance impact. This is as expected.
- a test program (C) was also written and used which creates,
allocates, and then writes to files as fast as possible. This test
program also shows the expected performance benefits.
- the performance benefits range from a few percent up to about 15 percent

Concerns:
- some were concerned that the spec makes no claims about
posix_fallocate being able to guarantee that the space allocated has
zeroes in it. This was discussed here and on the Linux Kernel mailing
list, wherein the expected behavior is that it does provide zeroes
- most systems don't allocate a great many new WAL files, so the
performance benefit is small
- your concern here

Benefits:
- new WAL file allocate is much quicker, more efficient (fewer system calls)
- the patch is (reportedly - I'm not a good judge here!) quite small


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Stefan Drees

On 2013-11.06 17:28, Jon Nelson wrote:

There hasn't been much activity here recently.  I'm curious, then, if
there are questions that I can answer.
It may be useful to summarize some things here:

- the purpose of the patch is to use posix_fallocate when creating new
WAL files, because it's (usually) much quicker
- using posix_fallocate is *one* system call versus 2048 calls to write(2)
- additionally, using posix_fallocate /guarantees/ that the filesystem
has space for the WAL file (by spec)
- reportedly (difficult to test or prove), using posix_fallocate *may*
reduce file fragmentation
- the (limited) testing I've done bears this out: the more new WAL
file creation there is, the more the improvement. Once the number of
WAL files reaches a constant point, there does not appear to be either
a positive or a negative performance impact. This is as expected.
- a test program (C) was also written and used which creates,
allocates, and then writes to files as fast as possible. This test
program also shows the expected performance benefits.
- the performance benefits range from a few percent up to about 15 percent


tried the test program of the patch at least a bit.

Retrieved it from:
http://www.postgresql.org/message-id/attachment/29088/test_fallocate.c

on an oldish linux box (Kernel 2.6.32, x86_64) following
$ gcc -o test_fallocate test_fallocate.c
a short
$ ./test_fallocate foo 1 1
yields:
without posix_fallocate: 1 open/close iterations, 1 rewrite in 26.1993s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 13.3299s

on another box (Kernel 3.2.0, x86_64) same procedure yields:
without posix_fallocate: 1 open/close iterations, 1 rewrite in 19.1972s
with posix_fallocate: 1 open/close iterations, 1 rewrite in 9.9280s

Note, when trying gcc -O2 test_fallocate.c fails to compile with:

In file included from /usr/include/fcntl.h:252:0,
 from test_fallocate.c:3:
In function ‘open’,
inlined from ‘main’ at test_fallocate.c:68:16:
/usr/include/x86_64-linux-gnu/bits/fcntl2.h:51:24: error: call to 
‘__open_missing_mode’ declared with attribute error: open with O_CREAT 
in second argument needs 3 arguments




Concerns:
- some were concerned that the spec makes no claims about
posix_fallocate being able to guarantee that the space allocated has
zeroes in it. This was discussed here and on the Linux Kernel mailing
list, wherein the expected behavior is that it does provide zeroes
- most systems don't allocate a great many new WAL files, so the
performance benefit is small
- your concern here

Benefits:
- new WAL file allocate is much quicker, more efficient (fewer system calls)
- the patch is (reportedly - I'm not a good judge here!) quite small


HTH,
Stefan.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Merlin Moncure
On Tue, Jun 11, 2013 at 11:08 AM, Stefan Drees ste...@drees.name wrote:
 On 2013-11.06 17:28, Jon Nelson wrote:

 There hasn't been much activity here recently.  I'm curious, then, if
 there are questions that I can answer.
 It may be useful to summarize some things here:

 - the purpose of the patch is to use posix_fallocate when creating new
 WAL files, because it's (usually) much quicker
 - using posix_fallocate is *one* system call versus 2048 calls to write(2)
 - additionally, using posix_fallocate /guarantees/ that the filesystem
 has space for the WAL file (by spec)
 - reportedly (difficult to test or prove), using posix_fallocate *may*
 reduce file fragmentation
 - the (limited) testing I've done bears this out: the more new WAL
 file creation there is, the more the improvement. Once the number of
 WAL files reaches a constant point, there does not appear to be either
 a positive or a negative performance impact. This is as expected.
 - a test program (C) was also written and used which creates,
 allocates, and then writes to files as fast as possible. This test
 program also shows the expected performance benefits.
 - the performance benefits range from a few percent up to about 15 percent


 tried the test program of the patch at least a bit.

 Retrieved it from:
 http://www.postgresql.org/message-id/attachment/29088/test_fallocate.c

 on an oldish linux box (Kernel 2.6.32, x86_64) following
 $ gcc -o test_fallocate test_fallocate.c
 a short
 $ ./test_fallocate foo 1 1
 yields:
 without posix_fallocate: 1 open/close iterations, 1 rewrite in 26.1993s
 with posix_fallocate: 1 open/close iterations, 1 rewrite in 13.3299s

 on another box (Kernel 3.2.0, x86_64) same procedure yields:
 without posix_fallocate: 1 open/close iterations, 1 rewrite in 19.1972s
 with posix_fallocate: 1 open/close iterations, 1 rewrite in 9.9280s

 Note, when trying gcc -O2 test_fallocate.c fails to compile with:

 In file included from /usr/include/fcntl.h:252:0,
  from test_fallocate.c:3:
 In function ‘open’,
 inlined from ‘main’ at test_fallocate.c:68:16:
 /usr/include/x86_64-linux-gnu/bits/fcntl2.h:51:24: error: call to
 ‘__open_missing_mode’ declared with attribute error: open with O_CREAT in
 second argument needs 3 arguments

It's understood that posix_fallocate is faster at this -- the question
on the table is 'does this matter in context of postgres?'.
Personally I think this patch should go in regardless -- the concerns
made IMNSHO are specious.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Stephen Frost
* Merlin Moncure (mmonc...@gmail.com) wrote:
 It's understood that posix_fallocate is faster at this -- the question
 on the table is 'does this matter in context of postgres?'.
 Personally I think this patch should go in regardless -- the concerns
 made IMNSHO are specious.

I've not had a chance to look at this patch, but I tend to agree with
Merlin.  My main question is really- would this be useful for extending
*relations*?  Apologies if it's already been discussed; I do plan to go
back and read the threads about this more fully, but I wanted to voice
my support for using posix_fallocate, when available, in general.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Greg Smith

On 6/11/13 12:22 PM, Merlin Moncure wrote:


Personally I think this patch should go in regardless -- the concerns
made IMNSHO are specious.


That's nice, but we have this process for validating whether features go 
in or not that relies on review instead of opinions.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Stefan Drees

On 2013-06-11 19:45 CEST, Greg Smith wrote:

On 6/11/13 12:22 PM, Merlin Moncure wrote:


Personally I think this patch should go in regardless -- the concerns
made IMNSHO are specious.


That's nice, but we have this process for validating whether features go
in or not that relies on review instead of opinions.

;-) that's why I played with the test_fallocate.c, as it was easy to do 
and I understood, the author (of the patch) wanted to trigger some 
reviews ... I do not (yet) know anything about the core codes, so I 
leave this to the hackers. My review result was, that with newer gcc's 
you should specify an open mode flag as third argument of the fopen 
call, as only with the test tool nothing important found.


Stefan.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Jon Nelson
On Tue, Jun 11, 2013 at 12:49 PM, Stefan Drees ste...@drees.name wrote:
 On 2013-06-11 19:45 CEST, Greg Smith wrote:

 On 6/11/13 12:22 PM, Merlin Moncure wrote:

 Personally I think this patch should go in regardless -- the concerns
 made IMNSHO are specious.


 That's nice, but we have this process for validating whether features go
 in or not that relies on review instead of opinions.

 ;-) that's why I played with the test_fallocate.c, as it was easy to do and
 I understood, the author (of the patch) wanted to trigger some reviews ... I
 do not (yet) know anything about the core codes, so I leave this to the
 hackers. My review result was, that with newer gcc's you should specify an
 open mode flag as third argument of the fopen call, as only with the test
 tool nothing important found.

If you change line 68 to read:

fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);

then it should work just fine (assuming you have not already done so).


--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-06-11 Thread Merlin Moncure
On Tue, Jun 11, 2013 at 12:45 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 6/11/13 12:22 PM, Merlin Moncure wrote:

 Personally I think this patch should go in regardless -- the concerns
 made IMNSHO are specious.

 That's nice, but we have this process for validating whether features go in
 or not that relies on review instead of opinions.

Sure: I phrased that badly by 'go in' I meant 'go through the review
process', not commit out of hand.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-31 Thread Noah Misch
On Thu, May 30, 2013 at 02:58:26PM +0200, Andres Freund wrote:
  * Andres Freund (and...@2ndquadrant.com) wrote:
   But really, I am not at all concerned about some obscure values being
   returned, but about a read() not being successful..

 After a bit of standard perusing writing a single byte to the end of the
 file after the fallocate ought to make at least the reading guaranteed
 to be defined. If we did seek(last_byte); write(); posix_fallocate() we
 should even always have defined content. Yuck.

This portion of the posix_fallocate() specification requires the hoped-for
effect on subsequent read() calls:

  If the offset+ len is beyond the current file size, then posix_fallocate()
  shall adjust the file size to offset+ len. Otherwise, the file size shall
  not be changed.
  -- 
http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fallocate.html

When the file size increases, read()'s defined behavior switches from
returning short to retrieving zeros.  There's no need for an additional
write() to ensure that.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Robert Haas
On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote:
 FWIW, posix' description about posix_fallocate() doesn't actually say
 *anything* about reading. The guarantee it makes is:
 If posix_fallocate() returns successfully, subsequent writes to the
 specified file data shall not fail due to the lack of free space on the
 file system storage media..

 http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html

 So we don't even know whether we can read. I think that means we need to
 zero the file anyway...

Surely this is undue pessimism.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 6:49 AM, Robert Haas wrote:

On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com wrote:

So we don't even know whether we can read. I think that means we need to
zero the file anyway...


Surely this is undue pessimism.


There have been many occasions where I've found the Linux kernel 
defining support for POSIX behavior with a NOP stub that basically says 
we should make this work one day.  I don't know whether the fallocate 
code is one of those or a fully implemented call.  Based on that 
history, until I see a reader that validates the resulting files are 
good I have to assume they're not.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 06:49:42 -0400, Robert Haas wrote:
 On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  FWIW, posix' description about posix_fallocate() doesn't actually say
  *anything* about reading. The guarantee it makes is:
  If posix_fallocate() returns successfully, subsequent writes to the
  specified file data shall not fail due to the lack of free space on the
  file system storage media..
 
  http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
 
  So we don't even know whether we can read. I think that means we need to
  zero the file anyway...
 
 Surely this is undue pessimism.

Why? The spec doesn't specify that case and that very well allows other
behaviour. Glibc sure does behave sensibly and zeroes the data
(sysdeps/posix/posix_fallocate64.c for the generic implementation) and
so does linux' fallocate() syscall, but that doesn't say much about
other implementations.

None of the manpages I could find, nor the spec says anything about the
file's contents in the extended range. Given there were at least three
manpages of different origins that didn't specify that behaviour I am
not too optimistic. Why they didn't specify that completely obvious
question is hard to understand from my pov.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 06:55:16 -0400, Greg Smith wrote:
 On 5/30/13 6:49 AM, Robert Haas wrote:
 On Wed, May 29, 2013 at 10:42 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 So we don't even know whether we can read. I think that means we need to
 zero the file anyway...
 
 Surely this is undue pessimism.
 
 There have been many occasions where I've found the Linux kernel defining
 support for POSIX behavior with a NOP stub that basically says we should
 make this work one day.  I don't know whether the fallocate code is one of
 those or a fully implemented call.  Based on that history, until I see a
 reader that validates the resulting files are good I have to assume they're
 not.

That argument in contrast I find not very convincing though. What was
the last incidence of such a system call that did not just error out
with ENOTSUPP or such?

The linux fallocate call is fully specified for this behaviour and got
added 2.6.23, there wasn't a stub before, so I am far less worried about
it than about the underspecifiedness of posix_fallocate(). Also, if some
system call doesn't follow its documented specifications it's not fully
our problem anymore. If we rely on undocumented behaviour though...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 7:17 AM, Andres Freund wrote:

That argument in contrast I find not very convincing though. What was
the last incidence of such a system call that did not just error out
with ENOTSUPP or such?


http://linux.die.net/man/2/posix_fadvise talks about  POSIX_FADV_NOREUSE 
and POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op, 
depending on your version.  I know there were more examples than just 
that one that popped up during the testing of effective_io_concurrency. 
 My starting position has to assume that posix_fallocate can have the 
same sort of surprising behavior that showed up repeatedly when we were 
trying to use posix_fadvise more aggressively.


The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same 
thing) was a similar issue, and that's the first one that left me 
forever skeptical of Linux kernel claims in this area until they are 
explicitly validated:  http://lwn.net/Articles/350225/


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 07:48:51 -0400, Greg Smith wrote:
 On 5/30/13 7:17 AM, Andres Freund wrote:
 That argument in contrast I find not very convincing though. What was
 the last incidence of such a system call that did not just error out
 with ENOTSUPP or such?
 
 http://linux.die.net/man/2/posix_fadvise talks about  POSIX_FADV_NOREUSE and
 POSIX_FADV_WILLNEED being both buggy and quietly mapped to a no-op,
 depending on your version.  I know there were more examples than just that
 one that popped up during the testing of effective_io_concurrency.  My
 starting position has to assume that posix_fallocate can have the same sort
 of surprising behavior that showed up repeatedly when we were trying to use
 posix_fadvise more aggressively.

Uh. How is that a correctness problem? fadvise is a hint which is pretty
different from a fallocate where ignoring would have way much more
severe consequences.
I don't think that's a very meaningful comparison.

 The way O_SYNC was quietly mapped to O_DSYNC (which isn't the same thing)
 was a similar issue, and that's the first one that left me forever skeptical
 of Linux kernel claims in this area until they are explicitly validated:
 http://lwn.net/Articles/350225/

Yea, but that mistake is literally decades old...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 7:52 AM, Andres Freund wrote:

fadvise is a hint which is pretty
different from a fallocate where ignoring would have way much more
severe consequences.


Yes, it will.  That's why I want to see it tested.  There is more than 
enough past examples of bad behavior here to be skeptical that this sort 
of API may not work exactly as specified.  If you're willing to believe 
the spec, that's fine, but I think that's dangerously optimistic.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote:
 Surely this is undue pessimism.

 Why? The spec doesn't specify that case and that very well allows other
 behaviour. Glibc sure does behave sensibly and zeroes the data
 (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
 so does linux' fallocate() syscall, but that doesn't say much about
 other implementations.

 None of the manpages I could find, nor the spec says anything about the
 file's contents in the extended range. Given there were at least three
 manpages of different origins that didn't specify that behaviour I am
 not too optimistic. Why they didn't specify that completely obvious
 question is hard to understand from my pov.

I think they didn't specify it because it IS obvious.  As Stephen
says, it's been understood for decades that allowing unzeroed pages to
be reallocated to some other file is a major security hole.  I think
we can assume that no credible OS does that.  If there's some OS out
there that chooses to fill the pre-extended pages with 0x55 or cat
/dev/urandom instead of 0x00, they probably deserve what they get.
It's hard for me to be believe that anything that silly actually
exists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:02:56 -0400, Robert Haas wrote:
 On Thu, May 30, 2013 at 7:13 AM, Andres Freund and...@2ndquadrant.com wrote:
  Surely this is undue pessimism.
 
  Why? The spec doesn't specify that case and that very well allows other
  behaviour. Glibc sure does behave sensibly and zeroes the data
  (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
  so does linux' fallocate() syscall, but that doesn't say much about
  other implementations.
 
  None of the manpages I could find, nor the spec says anything about the
  file's contents in the extended range. Given there were at least three
  manpages of different origins that didn't specify that behaviour I am
  not too optimistic. Why they didn't specify that completely obvious
  question is hard to understand from my pov.
 
 I think they didn't specify it because it IS obvious.  As Stephen
 says, it's been understood for decades that allowing unzeroed pages to
 be reallocated to some other file is a major security hole.  I think
 we can assume that no credible OS does that.  If there's some OS out
 there that chooses to fill the pre-extended pages with 0x55 or cat
 /dev/urandom instead of 0x00, they probably deserve what they get.
 It's hard for me to be believe that anything that silly actually
 exists.

I don't think there's much danger of getting uninitialized data or
such. That clearly would be insane. I think somebody might interpret it
as read(2) returning an error until the page has been written to which
isn't completely crazy.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Peter Eisentraut
On 5/30/13 7:13 AM, Andres Freund wrote:
 Why? The spec doesn't specify that case and that very well allows other
 behaviour. Glibc sure does behave sensibly and zeroes the data
 (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
 so does linux' fallocate() syscall, but that doesn't say much about
 other implementations.

glibc actually only writes one byte to every file system block, to make
sure the block is allocated.  It doesn't actually zero every byte.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Peter Eisentraut
On 5/30/13 8:02 AM, Robert Haas wrote:
 If there's some OS out
 there that chooses to fill the pre-extended pages with 0x55 or cat
 /dev/urandom instead of 0x00, they probably deserve what they get.

Even that wouldn't be a problem for our purpose.  The only problem would
be if you can't read from the allocated region at all.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote:
 On 5/30/13 8:02 AM, Robert Haas wrote:
  If there's some OS out
  there that chooses to fill the pre-extended pages with 0x55 or cat
  /dev/urandom instead of 0x00, they probably deserve what they get.
 
 Even that wouldn't be a problem for our purpose.  The only problem would
 be if you can't read from the allocated region at all.

Well, only as long as we only use it for preallocation of wal files. I
am much, much more interested in doing that for the heap. And there that
surely would be a problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:17:28 -0400, Peter Eisentraut wrote:
 On 5/30/13 7:13 AM, Andres Freund wrote:
  Why? The spec doesn't specify that case and that very well allows other
  behaviour. Glibc sure does behave sensibly and zeroes the data
  (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
  so does linux' fallocate() syscall, but that doesn't say much about
  other implementations.
 
 glibc actually only writes one byte to every file system block, to make
 sure the block is allocated.  It doesn't actually zero every byte.

Which is fine since that guarantees we can read from those areas... And
unless I misremember something that actually guarantees that the rest of
the data is initialized to zero as well. Yes: subsequent reads of data
in the gap shall return bytes with the value 0 until data is actually
written into the gap.

But really, I am not at all concerned about some obscure values being
returned, but about a read() not being successful..

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
 On 5/30/13 7:13 AM, Andres Freund wrote:
  Why? The spec doesn't specify that case and that very well allows other
  behaviour. Glibc sure does behave sensibly and zeroes the data
  (sysdeps/posix/posix_fallocate64.c for the generic implementation) and
  so does linux' fallocate() syscall, but that doesn't say much about
  other implementations.
 
 glibc actually only writes one byte to every file system block, to make
 sure the block is allocated.  It doesn't actually zero every byte.

That goes back to the 'sane implementation' question..  Is there a case
where that would actually be different from writing zeros for the entire
block..?  Is there some OS that gives you random data for the 'hole'
when you write a byte, seek to the start of the next block and then
write another byte?  That actually *would* be against what's documented
and required by spec, no?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2013-05-30 08:19:17 -0400, Peter Eisentraut wrote:
  On 5/30/13 8:02 AM, Robert Haas wrote:
   If there's some OS out
   there that chooses to fill the pre-extended pages with 0x55 or cat
   /dev/urandom instead of 0x00, they probably deserve what they get.
  
  Even that wouldn't be a problem for our purpose.  The only problem would
  be if you can't read from the allocated region at all.
 
 Well, only as long as we only use it for preallocation of wal files. I
 am much, much more interested in doing that for the heap. And there that
 surely would be a problem.

Yes, that was my thinking as well.  If posix_fallocate is faster than
writing out 8K of zeros, and the block can immediately be read as if it
had actually been written to, then I'd be very interested in using it to
extend heap files.  As I mentioned in this thread (or perhaps it was
another), I don't think this solves the locking issue around the
relation extention lock, but it might help some and it may make it
easier to tweak that logic to allocate larger chunks in the future.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 But really, I am not at all concerned about some obscure values being
 returned, but about a read() not being successful..

Alright, so what do we need to do to test this?  We really just need a
short C program written up and then a bunch of folks to run it on
various architectures, right?  Gee, sounds like what the buildfarm was
made for (alright, alright, PostgreSQL isn't exactly a 'short C
program', but you get the idea).  As I recall, Andrew reworked the
buildfarm code to be more modular too..  Anyone have thoughts about how
we could run these kinds of tests with it?  Or do people think that's a
bad idea?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Andres Freund
On 2013-05-30 08:53:37 -0400, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  But really, I am not at all concerned about some obscure values being
  returned, but about a read() not being successful..
 
 Alright, so what do we need to do to test this?  We really just need a
 short C program written up and then a bunch of folks to run it on
 various architectures, right?

After a bit of standard perusing writing a single byte to the end of the
file after the fallocate ought to make at least the reading guaranteed
to be defined. If we did seek(last_byte); write(); posix_fallocate() we
should even always have defined content. Yuck.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 After a bit of standard perusing writing a single byte to the end of the
 file after the fallocate ought to make at least the reading guaranteed
 to be defined. If we did seek(last_byte); write(); posix_fallocate() we
 should even always have defined content. Yuck.

Alright, but would that actually be any better than just doing what
glibc's posix_fallocate() does in the generic case?  And, to be honest,
it makes me a bit nervous to seek/write like that because it looks like
the typical create a hole setup, which we certainly aren't intending,
yet if the posix_fallocate() call disappeared, or did nothing, or this
code was copied w/o it, or someone didn't understand what it did, we
could end up with that.

Not a fan. :(

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 8:50 AM, Stephen Frost wrote:

I don't think this solves the locking issue around the
relation extention lock, but it might help some and it may make it
easier to tweak that logic to allocate larger chunks in the future.


It might make it a bit faster, but it doesn't make it any easier to 
implement.  The messy part of extending relations in larger chunks is 
how to communicate that back into the buffer manager usefully.  The 
extension path causing trouble is RelationGetBufferForTuple calling 
ReadBufferBI.  All of that is passing a single buffer around.  There's 
no simple way I can see to rewrite it to handle more than one at a time.


I have a test case for relation extension that I'm going to package up 
soon.  That makes it easy to see where the problem is at.  Far as I can 
tell the reason it hasn't been fixed before now is that it's a pain to 
write the code.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Robert Haas
On Thu, May 30, 2013 at 8:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 I don't think there's much danger of getting uninitialized data or
 such. That clearly would be insane. I think somebody might interpret it
 as read(2) returning an error until the page has been written to which
 isn't completely crazy.

In the absence of tangible evidence of some implementation that
behaves that way, I think that's just paranoia.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Alvaro Herrera
Greg Smith escribió:

 The messy part of extending relations in larger chunks
 is how to communicate that back into the buffer manager usefully.
 The extension path causing trouble is RelationGetBufferForTuple
 calling ReadBufferBI.  All of that is passing a single buffer
 around.  There's no simple way I can see to rewrite it to handle
 more than one at a time.

No, but we can have it create several pages and insert them into the
FSM.  So they aren't returned to the original caller but are available
to future users.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-30 Thread Greg Smith

On 5/30/13 11:21 AM, Alvaro Herrera wrote:

Greg Smith escribió:


The messy part of extending relations in larger chunks
is how to communicate that back into the buffer manager usefully.
The extension path causing trouble is RelationGetBufferForTuple
calling ReadBufferBI.  All of that is passing a single buffer
around.  There's no simple way I can see to rewrite it to handle
more than one at a time.


No, but we can have it create several pages and insert them into the
FSM.  So they aren't returned to the original caller but are available
to future users.


There's actually a code comment wondering about this topic for the pages 
that are already created, in src/backend/access/heap/hio.c :


Remember the new page as our target for future insertions.
XXX should we enter the new page into the free space map immediately, or 
just keep it for this backend's exclusive use in the short run (until 
VACUUM sees it)?  Seems to depend on whether you expect the current 
backend to make more insertions or not, which is probably a good bet 
most of the time.  So for now, don't add it to FSM yet.


We have to be careful about touching too much at that particular point, 
because it's holding a relation extension lock at the obvious spot to 
make a change.


There's an interesting overlap with these questions about how files are 
extended too, with this comment in that file too, just before the above:


XXX This does an lseek - rather expensive - but at the moment it is the 
only way to accurately determine how many blocks are in a relation.  Is 
it worth keeping an accurate file length in shared memory someplace, 
rather than relying on the kernel to do it for us?


That whole sequence of code took the easy way forward when it was 
written, but it's obvious the harder one (also touching the FSM) was 
considered even then.  The whole sequence needs to be revisited to pull 
off multiple page extension.  I wouldn't say it's hard, but it's enough 
work that I haven't been able to find a block of time to go through the 
whole thing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-29 Thread Peter Eisentraut
On 5/28/13 11:36 AM, Greg Smith wrote:
 Outside of the run for performance testing, I think it would be good at
 this point to validate that there is really a 16MB file full of zeroes
 resulting from these operations.  I am not really concerned that
 posix_fallocate might be slower in some cases; that seems unlikely.  I
 am concerned that it might result in a file that isn't structurally the
 same as the 16MB of zero writes implementation used now.

I see nothing in the posix_fallocate() man pages that says that the
allocated space is filled with any kind of data or zeroes.  It will
likely be garbage data, but that should be fine for a new WAL file.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-29 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
 On 5/28/13 11:36 AM, Greg Smith wrote:
  Outside of the run for performance testing, I think it would be good at
  this point to validate that there is really a 16MB file full of zeroes
  resulting from these operations.  I am not really concerned that
  posix_fallocate might be slower in some cases; that seems unlikely.  I
  am concerned that it might result in a file that isn't structurally the
  same as the 16MB of zero writes implementation used now.
 
 I see nothing in the posix_fallocate() man pages that says that the
 allocated space is filled with any kind of data or zeroes.  It will
 likely be garbage data, but that should be fine for a new WAL file.

I *really* hope that the Linux kernel, and other, folks are smart enough
to realize that they can't just re-use random blocks from an I/O device
without cleaning it first.  That would be one massive security hole.  I
expect posix_fallocate() actually works more like spase files, except
that it also counts the space as being 'taken', but it doesn't go out
and actually pull blocks to use until you actually go to write to it.
At which point, perhaps there's an optimization that says if the first
thing done with this is writing, then just write out whatever data is
requested and then fill the rest of the block out with zeros, and a
similar read operation which says if we havn't formally assigned a
block for this, just return zeros.  Hopefully it's smart enough to
avoid writing out all zeros and then turning around and writing out
whatever data is given, though since it'd all be in memory, perhaps
that wouldn't be too bad and might be simpler to implement.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-29 Thread Andres Freund
On 2013-05-29 10:36:07 -0400, Stephen Frost wrote:
 * Peter Eisentraut (pete...@gmx.net) wrote:
  On 5/28/13 11:36 AM, Greg Smith wrote:
   Outside of the run for performance testing, I think it would be good at
   this point to validate that there is really a 16MB file full of zeroes
   resulting from these operations.  I am not really concerned that
   posix_fallocate might be slower in some cases; that seems unlikely.  I
   am concerned that it might result in a file that isn't structurally the
   same as the 16MB of zero writes implementation used now.
  
  I see nothing in the posix_fallocate() man pages that says that the
  allocated space is filled with any kind of data or zeroes.  It will
  likely be garbage data, but that should be fine for a new WAL file.
 
 I *really* hope that the Linux kernel, and other, folks are smart enough
 to realize that they can't just re-use random blocks from an I/O device
 without cleaning it first.

FWIW, posix' description about posix_fallocate() doesn't actually say
*anything* about reading. The guarantee it makes is:
If posix_fallocate() returns successfully, subsequent writes to the
specified file data shall not fail due to the lack of free space on the
file system storage media..

http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html

So we don't even know whether we can read. I think that means we need to
zero the file anyway...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-29 Thread Peter Eisentraut
On 5/29/13 10:42 AM, Andres Freund wrote:
 On 2013-05-29 10:36:07 -0400, Stephen Frost wrote:
 I *really* hope that the Linux kernel, and other, folks are smart enough
 to realize that they can't just re-use random blocks from an I/O device
 without cleaning it first.
 
 FWIW, posix' description about posix_fallocate() doesn't actually say
 *anything* about reading. The guarantee it makes is:
 If posix_fallocate() returns successfully, subsequent writes to the
 specified file data shall not fail due to the lack of free space on the
 file system storage media..
 
 http://pubs.opengroup.org/onlinepubs/009696799/functions/posix_fallocate.html
 
 So we don't even know whether we can read. I think that means we need to
 zero the file anyway...

We could use Linux fallocate(), which does guarantee that the file reads
back as zeroes.  Or we use posix_fallocate() and write over the first
few bytes, enough for a subsequent reader to detect that it shouldn't
read any further.

But all of this is getting very complicated for such a marginal improvement.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Robert Haas
On Sat, May 25, 2013 at 2:55 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 The biggest thing missing from this submission is information about what
 performance testing you did.  Ideally performance patches are submitted with
 enough information for a reviewer to duplicate the same test the author did,
 as well as hard before/after performance numbers from your test system.  It
 often turns tricky to duplicate a performance gain, and being able to run
 the same test used for initial development eliminates a lot of the problems.

 This has been a bit of a struggle. While it's true that WAL file
 creation doesn't happen with great frequency, and while it's also true
 that - with strace and other tests - it can be proven that
 fallocate(16MB) is much quicker than writing it zeroes by hand,
 proving that in the larger context of a running install has been
 challenging.

It's nice to be able to test things in the context of a running
install, but sometimes a microbenchmark is just as good.  I mean, if
posix_fallocate() is faster, then it's just faster, right?  It's
likely to be pretty hard to get reproducible numbers for how much this
actually helps in the real world because write tests are inherently
pretty variable depending on a lot of factors we don't control, so
even if Jon has got the best possible test, the numbers may bounce
around so much that you can't really measure the (probably small) gain
from this approach.  But that doesn't seem like a reason not to adopt
the approach and take whatever gain there is.  At least, not that I
can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Andres Freund
On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
 On Sat, May 25, 2013 at 2:55 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
  The biggest thing missing from this submission is information about what
  performance testing you did.  Ideally performance patches are submitted 
  with
  enough information for a reviewer to duplicate the same test the author 
  did,
  as well as hard before/after performance numbers from your test system.  It
  often turns tricky to duplicate a performance gain, and being able to run
  the same test used for initial development eliminates a lot of the 
  problems.
 
  This has been a bit of a struggle. While it's true that WAL file
  creation doesn't happen with great frequency, and while it's also true
  that - with strace and other tests - it can be proven that
  fallocate(16MB) is much quicker than writing it zeroes by hand,
  proving that in the larger context of a running install has been
  challenging.
 
 It's nice to be able to test things in the context of a running
 install, but sometimes a microbenchmark is just as good.  I mean, if
 posix_fallocate() is faster, then it's just faster, right?

Well, it's a bit more complex than that. Fallocate doesn't actually
initializes the disk space in most filesystems, just marks it as
allocated and zeroed which is one of the reasons it can be noticeably
faster. But that can make the runtime overhead of writing to those pages
higher.

I wonder whether noticeably upping checkpoint segments and then
a) COPY in a large table
b) a pgbench on a previously initialized table.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Robert Haas
On Tue, May 28, 2013 at 10:15 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
 On Sat, May 25, 2013 at 2:55 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:
  The biggest thing missing from this submission is information about what
  performance testing you did.  Ideally performance patches are submitted 
  with
  enough information for a reviewer to duplicate the same test the author 
  did,
  as well as hard before/after performance numbers from your test system.  
  It
  often turns tricky to duplicate a performance gain, and being able to run
  the same test used for initial development eliminates a lot of the 
  problems.
 
  This has been a bit of a struggle. While it's true that WAL file
  creation doesn't happen with great frequency, and while it's also true
  that - with strace and other tests - it can be proven that
  fallocate(16MB) is much quicker than writing it zeroes by hand,
  proving that in the larger context of a running install has been
  challenging.

 It's nice to be able to test things in the context of a running
 install, but sometimes a microbenchmark is just as good.  I mean, if
 posix_fallocate() is faster, then it's just faster, right?

 Well, it's a bit more complex than that. Fallocate doesn't actually
 initializes the disk space in most filesystems, just marks it as
 allocated and zeroed which is one of the reasons it can be noticeably
 faster. But that can make the runtime overhead of writing to those pages
 higher.

Maybe it would be good to measure that impact.  Something like this:

1. Write 16MB of zeroes to an empty file in the same size chunks we're
currently using (8kB?).  Time that.  Rewrite the file with real data.
Time that.
2. posix_fallocate() an empty file out to 16MB.  Time that.  Rewrite
the fie with real data.  Time that.

Personally, I have trouble believing that writing 16MB of zeroes by
hand is better than telling the OS to do it for us.  If that's so,
the OS is just stupid, because it ought to be able to optimize the
crap out of that compared to anything we can do.  Of course, it is
more than possible that the OS is in fact stupid.  But I'd like to
hope not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Jon Nelson
On Tue, May 28, 2013 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, May 28, 2013 at 10:15 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
 On Sat, May 25, 2013 at 2:55 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:
  The biggest thing missing from this submission is information about what
  performance testing you did.  Ideally performance patches are submitted 
  with
  enough information for a reviewer to duplicate the same test the author 
  did,
  as well as hard before/after performance numbers from your test system.  
  It
  often turns tricky to duplicate a performance gain, and being able to run
  the same test used for initial development eliminates a lot of the 
  problems.
 
  This has been a bit of a struggle. While it's true that WAL file
  creation doesn't happen with great frequency, and while it's also true
  that - with strace and other tests - it can be proven that
  fallocate(16MB) is much quicker than writing it zeroes by hand,
  proving that in the larger context of a running install has been
  challenging.

 It's nice to be able to test things in the context of a running
 install, but sometimes a microbenchmark is just as good.  I mean, if
 posix_fallocate() is faster, then it's just faster, right?

 Well, it's a bit more complex than that. Fallocate doesn't actually
 initializes the disk space in most filesystems, just marks it as
 allocated and zeroed which is one of the reasons it can be noticeably
 faster. But that can make the runtime overhead of writing to those pages
 higher.

 Maybe it would be good to measure that impact.  Something like this:

 1. Write 16MB of zeroes to an empty file in the same size chunks we're
 currently using (8kB?).  Time that.  Rewrite the file with real data.
 Time that.
 2. posix_fallocate() an empty file out to 16MB.  Time that.  Rewrite
 the fie with real data.  Time that.

 Personally, I have trouble believing that writing 16MB of zeroes by
 hand is better than telling the OS to do it for us.  If that's so,
 the OS is just stupid, because it ought to be able to optimize the
 crap out of that compared to anything we can do.  Of course, it is
 more than possible that the OS is in fact stupid.  But I'd like to
 hope not.

I wrote a little C program to do something very similar to that (which
I'll hope to post later today).
It opens a new file, fallocates 16MB, calls fdatasync.  Then it loops
10 times:  seek to the start of the file, writes 16MB of ones, calls
fdatasync.
Then it closes and removes the file, re-opens it, and this time writes
out 16MB of zeroes, calls fdatasync, and then does the same loop as
above. The program times the process from file open to file unlink,
inclusive.

The results - for me - are pretty consistent: using fallocate is
12-13% quicker than writing out zeroes. I used fdatasync twice to
(attempt) to mimic what the WAL writer does.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Andres Freund
On 2013-05-28 10:12:05 -0500, Jon Nelson wrote:
 On Tue, May 28, 2013 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
  On Tue, May 28, 2013 at 10:15 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  On 2013-05-28 10:03:58 -0400, Robert Haas wrote:
  On Sat, May 25, 2013 at 2:55 PM, Jon Nelson jnelson+pg...@jamponi.net 
  wrote:
   The biggest thing missing from this submission is information about 
   what
   performance testing you did.  Ideally performance patches are 
   submitted with
   enough information for a reviewer to duplicate the same test the 
   author did,
   as well as hard before/after performance numbers from your test 
   system.  It
   often turns tricky to duplicate a performance gain, and being able to 
   run
   the same test used for initial development eliminates a lot of the 
   problems.
  
   This has been a bit of a struggle. While it's true that WAL file
   creation doesn't happen with great frequency, and while it's also true
   that - with strace and other tests - it can be proven that
   fallocate(16MB) is much quicker than writing it zeroes by hand,
   proving that in the larger context of a running install has been
   challenging.
 
  It's nice to be able to test things in the context of a running
  install, but sometimes a microbenchmark is just as good.  I mean, if
  posix_fallocate() is faster, then it's just faster, right?
 
  Well, it's a bit more complex than that. Fallocate doesn't actually
  initializes the disk space in most filesystems, just marks it as
  allocated and zeroed which is one of the reasons it can be noticeably
  faster. But that can make the runtime overhead of writing to those pages
  higher.
 
  Maybe it would be good to measure that impact.  Something like this:
 
  1. Write 16MB of zeroes to an empty file in the same size chunks we're
  currently using (8kB?).  Time that.  Rewrite the file with real data.
  Time that.
  2. posix_fallocate() an empty file out to 16MB.  Time that.  Rewrite
  the fie with real data.  Time that.
 
  Personally, I have trouble believing that writing 16MB of zeroes by
  hand is better than telling the OS to do it for us.  If that's so,
  the OS is just stupid, because it ought to be able to optimize the
  crap out of that compared to anything we can do.  Of course, it is
  more than possible that the OS is in fact stupid.  But I'd like to
  hope not.

 I wrote a little C program to do something very similar to that (which
 I'll hope to post later today).
 It opens a new file, fallocates 16MB, calls fdatasync.  Then it loops
 10 times:  seek to the start of the file, writes 16MB of ones, calls
 fdatasync.

You need to call fsync() not fdatasync() the first time round. fdatasync
doesn't guarantee metadata is synced.

 Then it closes and removes the file, re-opens it, and this time writes
 out 16MB of zeroes, calls fdatasync, and then does the same loop as
 above. The program times the process from file open to file unlink,
 inclusive.

 The results - for me - are pretty consistent: using fallocate is
 12-13% quicker than writing out zeroes.

Cool!

 I used fdatasync twice to (attempt) to mimic what the WAL writer does.

Not sure what you mean by that though?

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Greg Smith

On 5/28/13 11:12 AM, Jon Nelson wrote:

It opens a new file, fallocates 16MB, calls fdatasync.


Outside of the run for performance testing, I think it would be good at 
this point to validate that there is really a 16MB file full of zeroes 
resulting from these operations.  I am not really concerned that 
posix_fallocate might be slower in some cases; that seems unlikely.  I 
am concerned that it might result in a file that isn't structurally the 
same as the 16MB of zero writes implementation used now.


The timing program you're writing has some aspects that are similar to 
the contrib/pg_test_fsync program.  You might borrow some code from 
there usefully.


To clarify the suggestion I was making before about including 
performance test results:  that doesn't necessarily mean the testing 
code must run using only the database.  That's better if possible, but 
as Robert says it may not be for some optimizations.  The important 
thing is to have something measuring the improvement that a reviewer can 
duplicate, and if that's a standalone benchmark problem that's still 
very useful.  The main thing I'm wary of is any this should be faster 
claims that don't come with any repeatable measurements at all.  Very 
often theories about the fastest way to do something don't match what's 
actually seen in testing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Jon Nelson
On Tue, May 28, 2013 at 10:36 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 5/28/13 11:12 AM, Jon Nelson wrote:

 It opens a new file, fallocates 16MB, calls fdatasync.


 Outside of the run for performance testing, I think it would be good at this
 point to validate that there is really a 16MB file full of zeroes resulting
 from these operations.  I am not really concerned that posix_fallocate might
 be slower in some cases; that seems unlikely.  I am concerned that it might
 result in a file that isn't structurally the same as the 16MB of zero writes
 implementation used now.

util-linux comes with fallocate which (might?) suffice for testing in
that respect, no?
If that is a real concern, it could be made part of the autoconf
testing, perhaps.

 The timing program you're writing has some aspects that are similar to the
 contrib/pg_test_fsync program.  You might borrow some code from there
 usefully.

Thanks! If it looks like what I'm attaching will not do, then I'll
look at that as a possible next step.

 To clarify the suggestion I was making before about including performance
 test results:  that doesn't necessarily mean the testing code must run using
 only the database.  That's better if possible, but as Robert says it may not
 be for some optimizations.  The important thing is to have something
 measuring the improvement that a reviewer can duplicate, and if that's a
 standalone benchmark problem that's still very useful.  The main thing I'm
 wary of is any this should be faster claims that don't come with any
 repeatable measurements at all.  Very often theories about the fastest way
 to do something don't match what's actually seen in testing.

Ack.
A note: The attached test program uses *fsync* instead of *fdatasync*
after calling fallocate (or writing out 16MB of zeroes), per an
earlier suggestion.

--
Jon
#include stdlib.h
#include stdio.h
#include fcntl.h
#include unistd.h
#include string.h
#include sys/time.h
#include sys/types.h

#define SIXTEENMB 1024*1024*16
#define EIGHTKB 1024*8

void writeout(int fd, char *buf)
{
	int i;
	for (i = 0; i  SIXTEENMB / EIGHTKB; ++i) {
		if (write(fd, buf, EIGHTKB) != EIGHTKB) {
			fprintf(stderr, Error in write: %m!\n);
			exit(1);
		}
	}
}

int main(int argc, char *argv[])
{
	int with_fallocate, open_close_iterations, rewrite_iterations;
	int fd, i, j;
	double tt;
	struct timeval tv1, tv2;
	char *buf0, *buf1;
	const char *filename; /* for convenience */

	if (argc != 4) {
		fprintf(stderr, Usage: %s filename open/close iterations rewrite iterations\n, argv[0]);
		exit(1);
	}
	
	filename = argv[1];

	open_close_iterations = atoi(argv[2]);
	if (open_close_iterations  0) {
		fprintf(stderr, Error parsing 'open_close_iterations'!\n);
		exit(1);
	}

	rewrite_iterations = atoi(argv[3]);
	if (rewrite_iterations  0) {
		fprintf(stderr, Error parsing 'rewrite_iterations'!\n);
		exit(1);
	}

	buf0 = malloc(SIXTEENMB);
	if (!buf0) {
		fprintf(stderr, Unable to allocate memory!\n);
		exit(1);
	}
	memset(buf0, 0, SIXTEENMB);

	buf1 = malloc(SIXTEENMB);
	if (!buf1) {
		fprintf(stderr, Unable to allocate memory!\n);
		exit(1);
	}
	memset(buf1, 1, SIXTEENMB);

	for (with_fallocate = 0;with_fallocate  2;++with_fallocate) {
		for (i = 0;i  open_close_iterations; ++i) {
			gettimeofday(tv1, NULL);
			fd = open(filename, O_CREAT | O_EXCL | O_WRONLY);
			if (fd  0) {
fprintf(stderr, Error opening file: %m\n);
exit(1);
			}
			if (with_fallocate) {
if (posix_fallocate(fd, 0, SIXTEENMB) != 0) {
	fprintf(stderr, Error in posix_fallocate!\n);
exit(1);
}
			} else {
writeout(fd, buf0);
			}
			if (fsync(fd)) {
fprintf(stderr, Error in fdatasync: %m!\n);
exit(1);
			}
			for (j = 0; j  rewrite_iterations; ++j) {
lseek(fd, 0, SEEK_SET);
writeout(fd, buf1);
if (fdatasync(fd)) {
	fprintf(stderr, Error in fdatasync: %m!\n);
	exit(1);
}
			}
			if (close(fd)) {
fprintf(stderr, Error in close: %m!\n);
exit(1);
			}
			unlink(filename);		/* don't check for error */
		}
		gettimeofday(tv2, NULL);
		tt = (tv2.tv_usec + tv2.tv_sec * 100) - (tv1.tv_usec + tv1.tv_sec * 100);
		tt /= 100;
		fprintf(stderr,
			with%s posix_fallocate: %d open/close iterations, %d rewrite in %0.4fs\n,
			(with_fallocate ?  : out), open_close_iterations, rewrite_iterations, tt);
		sleep(5);
	}
	/* cleanup */
	free(buf0);
	free(buf1);
	return 0;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-28 Thread Greg Smith

On 5/28/13 10:00 PM, Jon Nelson wrote:

On Tue, May 28, 2013 at 10:36 AM, Greg Smith g...@2ndquadrant.com wrote:

On 5/28/13 11:12 AM, Jon Nelson wrote:


It opens a new file, fallocates 16MB, calls fdatasync.



Outside of the run for performance testing, I think it would be good at this
point to validate that there is really a 16MB file full of zeroes resulting
from these operations.  I am not really concerned that posix_fallocate might
be slower in some cases; that seems unlikely.  I am concerned that it might
result in a file that isn't structurally the same as the 16MB of zero writes
implementation used now.


util-linux comes with fallocate which (might?) suffice for testing in
that respect, no?
If that is a real concern, it could be made part of the autoconf
testing, perhaps.


I was just thinking of something to run in your test program, not 
another build time check.  Just run the new allocation sequence, and 
then check the resulting WAL file for a) correct length, and b) 16K of 
zero bytes.  I would like to build some confidence that posix_fallocate 
is operating correctly in this context on at least one platform.  My 
experience with Linux handling this class of functions correctly has 
left me skeptical of them working until that's proven to be the case.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-25 Thread Jon Nelson
On Thu, May 16, 2013 at 7:05 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 5/16/13 9:16 AM, Jon Nelson wrote:

 Am I doing this the right way? Should I be posting the full patch each
 time, or incremental patches?


 There are guidelines for getting your patch in the right format at
 https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git
 that would improve this one.  You have some formatting issues with tab
 spacing at lines 120 through 133 in your v3 patch.  And it looks like there
 was a formatting change on line 146 that is making the diff larger than it
 needs to be.

I've corrected the formatting change (end-of-line whitespace was
stripped) on line 146.
The other whitespace changes are - I think - due to newly-indented
code due to a new code block.
Included please find a v4 patch which uses context diffs per the above url.

 The biggest thing missing from this submission is information about what
 performance testing you did.  Ideally performance patches are submitted with
 enough information for a reviewer to duplicate the same test the author did,
 as well as hard before/after performance numbers from your test system.  It
 often turns tricky to duplicate a performance gain, and being able to run
 the same test used for initial development eliminates a lot of the problems.

This has been a bit of a struggle. While it's true that WAL file
creation doesn't happen with great frequency, and while it's also true
that - with strace and other tests - it can be proven that
fallocate(16MB) is much quicker than writing it zeroes by hand,
proving that in the larger context of a running install has been
challenging.

Attached you'll find a small test script (t.sh) which creates a new
cluster in 'foo', changes some config values, starts the cluster, and
then times how long it takes pgbench to prepare a database. I've used
wal_level = hot_standby in the hopes that this generates the largest
number of WAL files (and I set the number of such files to 1024). The
hardware is an AMD 9150e with a 2-disk software RAID1 (SATA disks) on
kernel 3.9.2 and ext4 (x86_64, openSUSE 12.3). The test results are
not that surprising. The longer the test (the larger the scale factor)
the less of a difference using posix_fallocate makes. With a scale
factor of 100, I see an average of 10-11% reduction in the time taken
to initialize the database. With 300, it's about 5.5% and with 900,
it's between 0 and 1.2%. I will be doing more testing but this is what
I started with. I'm very open to suggestions.

 Second bit of nitpicking.  There are already some GUC values that appear or
 disappear based on compile time options.  They're all debugging related
 things though.  I would prefer not to see this one go away when it's
 implementation isn't available.  That's going to break any scripts that SHOW
 the setting to see if it's turned on or not as a first problem.  I think the
 right model to follow here is the IFDEF setup used for
 effective_io_concurrency.  I wouldn't worry about this too much though.
 Having a wal_use_fallocate GUC is good for testing.  But if it works out
 well, when it's ready for commit I don't see why anyone would want it turned
 off on platforms where it works.  There are already too many performance
 tweaking GUCs.  Something has to be very likely to be changed from the
 default before its worth adding one for it.

Ack.  I've revised the patch to always have the GUC (for now), default
to false, and if configure can't find posix_fallocate (or the user
disables it by way of pg_config_manual.h) then it remains a GUC that
simply can't be changed.

I'll also be re-running the tests.

--
Jon
attachment: plot.png

fallocate-v4.patch
Description: Binary data


t.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-17 Thread Andres Freund
On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
  * Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?
 
 In my limited testing, I noticed a drop of approx. 100ms per WAL file.
 I do not have a good idea for how to really stress the WAL-file
 creation area without calling pg_start_backup and pg_stop_backup over
 and over (with archiving enabled).

My point is that wal file creation usually isn't all that performance
sensitive. Once the cluster has enough WAL files it will usually recycle
them and thus never allocate new ones. So for this to be really
beneficial it would be interesting to show different performance during
normal running. You could also check out of how many extents a wal file
is made out of with fallocate in comparison to the old style method
(filefrag will give you that for most filesystems).

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-17 Thread Merlin Moncure
On Fri, May 17, 2013 at 4:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
  * Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

 In my limited testing, I noticed a drop of approx. 100ms per WAL file.
 I do not have a good idea for how to really stress the WAL-file
 creation area without calling pg_start_backup and pg_stop_backup over
 and over (with archiving enabled).

 My point is that wal file creation usually isn't all that performance
 sensitive. Once the cluster has enough WAL files it will usually recycle
 them and thus never allocate new ones. So for this to be really
 beneficial it would be interesting to show different performance during
 normal running. You could also check out of how many extents a wal file
 is made out of with fallocate in comparison to the old style method
 (filefrag will give you that for most filesystems).

But why does it have to be *really* beneficial?  We're already making
optional posix_fxxx calls and fallocate seems to do exactly what we
would want in this context.  Even if the 100ms drop doesn't show up
all that often, I'd still take it just for the defragmentation
benefits and the patch is fairly tiny.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-17 Thread Merlin Moncure
On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Fri, May 17, 2013 at 4:47 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
  * Is wal file creation performance actually relevant? Is the performance
of a system running on fallocate()d wal files any different?

 In my limited testing, I noticed a drop of approx. 100ms per WAL file.
 I do not have a good idea for how to really stress the WAL-file
 creation area without calling pg_start_backup and pg_stop_backup over
 and over (with archiving enabled).

 My point is that wal file creation usually isn't all that performance
 sensitive. Once the cluster has enough WAL files it will usually recycle
 them and thus never allocate new ones. So for this to be really
 beneficial it would be interesting to show different performance during
 normal running. You could also check out of how many extents a wal file
 is made out of with fallocate in comparison to the old style method
 (filefrag will give you that for most filesystems).

 But why does it have to be *really* beneficial?  We're already making
 optional posix_fxxx calls and fallocate seems to do exactly what we
 would want in this context.  Even if the 100ms drop doesn't show up
 all that often, I'd still take it just for the defragmentation
 benefits and the patch is fairly tiny.

Here is sample output of filefrag on a somewhat busy database from our
testing environment that exactly duplicates our production workloads..
 It does a lot of batch processing at night and a mix of 80%oltp 20%
olap during the day.  This is on ext3.  Interestingly, on ext4 servers
I never saw more than 2 extents per file (but those servers are mostly
not as busy).

[root@rpisatysw001 pg_xlog]# filefrag *
000106D20064: 490 extents found, perfection would be 1 extent
000106D20065: 33 extents found, perfection would be 1 extent
000106D20066: 43 extents found, perfection would be 1 extent
000106D20067: 71 extents found, perfection would be 1 extent
000106D20068: 43 extents found, perfection would be 1 extent
000106D20069: 156 extents found, perfection would be 1 extent
000106D2006A: 52 extents found, perfection would be 1 extent
000106D2006B: 108 extents found, perfection would be 1 extent

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-17 Thread Andres Freund
On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote:
 On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure mmonc...@gmail.com wrote:
  On Fri, May 17, 2013 at 4:47 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
   * Is wal file creation performance actually relevant? Is the performance
 of a system running on fallocate()d wal files any different?
 
  In my limited testing, I noticed a drop of approx. 100ms per WAL file.
  I do not have a good idea for how to really stress the WAL-file
  creation area without calling pg_start_backup and pg_stop_backup over
  and over (with archiving enabled).
 
  My point is that wal file creation usually isn't all that performance
  sensitive. Once the cluster has enough WAL files it will usually recycle
  them and thus never allocate new ones. So for this to be really
  beneficial it would be interesting to show different performance during
  normal running. You could also check out of how many extents a wal file
  is made out of with fallocate in comparison to the old style method
  (filefrag will give you that for most filesystems).
 
  But why does it have to be *really* beneficial?  We're already making
  optional posix_fxxx calls and fallocate seems to do exactly what we
  would want in this context.  Even if the 100ms drop doesn't show up
  all that often, I'd still take it just for the defragmentation
  benefits and the patch is fairly tiny.

Well, it needs to be tested et al. And its a fairly critical code
path. I seem to remember that there were older glibc versions that
didn't do such a great job at emulating fallocate for example.

 Here is sample output of filefrag on a somewhat busy database from our
 testing environment that exactly duplicates our production workloads..
  It does a lot of batch processing at night and a mix of 80%oltp 20%
 olap during the day.  This is on ext3.  Interestingly, on ext4 servers
 I never saw more than 2 extents per file (but those servers are mostly
 not as busy).

Ok, that's pretty bad. 490 extents in one file? Really? I'd consider
shutting down the cluster, copying the wal files in a moment where there
is enough free space. Just don't forget to sync afterwards.
EXT4 is notably better at allocating space in growing files than ext3
due to delayed allocation (and other things), so it wouldn't surprise me
similar differences in fragmentation even if the load were comparable.

Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a
patch anyway.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-17 Thread Merlin Moncure
On Fri, May 17, 2013 at 4:18 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-05-17 15:48:38 -0500, Merlin Moncure wrote:
 On Fri, May 17, 2013 at 8:29 AM, Merlin Moncure mmonc...@gmail.com wrote:
  On Fri, May 17, 2013 at 4:47 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
  On 2013-05-15 16:46:33 -0500, Jon Nelson wrote:
   * Is wal file creation performance actually relevant? Is the 
   performance
 of a system running on fallocate()d wal files any different?
 
  In my limited testing, I noticed a drop of approx. 100ms per WAL file.
  I do not have a good idea for how to really stress the WAL-file
  creation area without calling pg_start_backup and pg_stop_backup over
  and over (with archiving enabled).
 
  My point is that wal file creation usually isn't all that performance
  sensitive. Once the cluster has enough WAL files it will usually recycle
  them and thus never allocate new ones. So for this to be really
  beneficial it would be interesting to show different performance during
  normal running. You could also check out of how many extents a wal file
  is made out of with fallocate in comparison to the old style method
  (filefrag will give you that for most filesystems).
 
  But why does it have to be *really* beneficial?  We're already making
  optional posix_fxxx calls and fallocate seems to do exactly what we
  would want in this context.  Even if the 100ms drop doesn't show up
  all that often, I'd still take it just for the defragmentation
  benefits and the patch is fairly tiny.

 Well, it needs to be tested et al. And its a fairly critical code
 path. I seem to remember that there were older glibc versions that
 didn't do such a great job at emulating fallocate for example.

 Here is sample output of filefrag on a somewhat busy database from our
 testing environment that exactly duplicates our production workloads..
  It does a lot of batch processing at night and a mix of 80%oltp 20%
 olap during the day.  This is on ext3.  Interestingly, on ext4 servers
 I never saw more than 2 extents per file (but those servers are mostly
 not as busy).

 Ok, that's pretty bad. 490 extents in one file? Really? I'd consider
 shutting down the cluster, copying the wal files in a moment where there
 is enough free space. Just don't forget to sync afterwards.
 EXT4 is notably better at allocating space in growing files than ext3
 due to delayed allocation (and other things), so it wouldn't surprise me
 similar differences in fragmentation even if the load were comparable.

 Ext3 doesn't have fallocate btw, so it wouldn't benefit from such a
 patch anyway.

yeah -- I see your point.  The object lesson isn't so much 'improve
postgres' as it is to 'use a modern filesystem'.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-16 Thread Jon Nelson
On Wed, May 15, 2013 at 10:36 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 Jon Nelson escribió:
 On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:

  That's true. I originally wrote the patch using fallocate(2). What
  would be appropriate here? Should I switch on the return value and the
  six (6) or so relevant error codes?

 I addressed this, hopefully in a reasonable way.

 Would it work to just assign the value you got from posix_fallocate (if
 nonzero) to errno and then use %m in the errmsg() call in ereport()?

 That strikes me as a better way. I'll work something up soon.
 Thanks!

Please find attached version 3.
Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?


--
Jon


fallocate-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-16 Thread Alvaro Herrera
Jon Nelson escribió:

 Am I doing this the right way? Should I be posting the full patch each
 time, or incremental patches?

Full patch each time is okay.  Context-format patch is even better.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-16 Thread Greg Smith

On 5/16/13 9:16 AM, Jon Nelson wrote:

Am I doing this the right way? Should I be posting the full patch each
time, or incremental patches?


There are guidelines for getting your patch in the right format at 
https://wiki.postgresql.org/wiki/Working_with_Git#Context_diffs_with_Git 
that would improve this one.  You have some formatting issues with tab 
spacing at lines 120 through 133 in your v3 patch.  And it looks like 
there was a formatting change on line 146 that is making the diff larger 
than it needs to be.


The biggest thing missing from this submission is information about what 
performance testing you did.  Ideally performance patches are submitted 
with enough information for a reviewer to duplicate the same test the 
author did, as well as hard before/after performance numbers from your 
test system.  It often turns tricky to duplicate a performance gain, and 
being able to run the same test used for initial development eliminates 
a lot of the problems.


Second bit of nitpicking.  There are already some GUC values that appear 
or disappear based on compile time options.  They're all debugging 
related things though.  I would prefer not to see this one go away when 
it's implementation isn't available.  That's going to break any scripts 
that SHOW the setting to see if it's turned on or not as a first 
problem.  I think the right model to follow here is the IFDEF setup used 
for effective_io_concurrency.  I wouldn't worry about this too much 
though.  Having a wal_use_fallocate GUC is good for testing.  But if it 
works out well, when it's ready for commit I don't see why anyone would 
want it turned off on platforms where it works.  There are already too 
many performance tweaking GUCs.  Something has to be very likely to be 
changed from the default before its worth adding one for it.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Tue, May 14, 2013 at 9:43 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, May 13, 2013 at 9:54 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Pertinent to another thread titled
 [HACKERS] corrupt pages detected by enabling checksums
 I hope to explore the possibility of using fallocate (or
 posix_fallocate) for new WAL file creation.

 Most modern Linux filesystems support fast fallocate/posix_fallocate,
 reducing extent fragmentation (where extents are used) and frequently
 offering a pretty significant speed improvement. In my tests, using
 posix_fallocate (followed by pg_fsync) is at least 28 times quicker
 than using the current method (which writes zeroes followed by
 pg_fsync).

 I have written up a patch to use posix_fallocate in new WAL file
 creation, including configuration by way of a GUC variable, but I've
 not contributed to the PostgreSQL project before. Therefore, I'm
 fairly certain the patch is not formatted properly or conforms to the
 appropriate style guides. Currently, the patch is based on 9.2, and is
 quite small in size - 3.6KiB.

I have re-based and reformatted the code, and basic testing shows a
reduction in WAL-file creation time of a fairly significant amount.
I ran 'make test' and did additional local testing without issue.
Therefore, I am attaching the patch. I will try to add it to the
commitfest page.


--
Jon


0001-enhance-GUC-and-xlog-with-wal_use_fallocate-boolean-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Andres Freund
Hi,

On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:
  I have written up a patch to use posix_fallocate in new WAL file
  creation, including configuration by way of a GUC variable, but I've
  not contributed to the PostgreSQL project before. Therefore, I'm
  fairly certain the patch is not formatted properly or conforms to the
  appropriate style guides. Currently, the patch is based on 9.2, and is
  quite small in size - 3.6KiB.
 
 I have re-based and reformatted the code, and basic testing shows a
 reduction in WAL-file creation time of a fairly significant amount.
 I ran 'make test' and did additional local testing without issue.
 Therefore, I am attaching the patch. I will try to add it to the
 commitfest page.

Some where quick comments, without thinking about this:

* needs a configure check for posix_fallocate. The current version will
  e.g. fail to compile on windows or many other non linux systems. Check
  how its done for posix_fadvise.
* Is wal file creation performance actually relevant? Is the performance
  of a system running on fallocate()d wal files any different?
* According to the man page posix_fallocate doesn't set errno but rather
  returns the error code.
* I wonder whether we ever want to actually disable this? Afair the libc
  contains emulation for posix_fadvise if the filesystem doesn't support
  it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Wed, May 15, 2013 at 4:34 PM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2013-05-15 16:26:15 -0500, Jon Nelson wrote:
  I have written up a patch to use posix_fallocate in new WAL file
  creation, including configuration by way of a GUC variable, but I've
  not contributed to the PostgreSQL project before. Therefore, I'm
  fairly certain the patch is not formatted properly or conforms to the
  appropriate style guides. Currently, the patch is based on 9.2, and is
  quite small in size - 3.6KiB.

 I have re-based and reformatted the code, and basic testing shows a
 reduction in WAL-file creation time of a fairly significant amount.
 I ran 'make test' and did additional local testing without issue.
 Therefore, I am attaching the patch. I will try to add it to the
 commitfest page.

 Some where quick comments, without thinking about this:

Thank you for the kind feedback.

 * needs a configure check for posix_fallocate. The current version will
   e.g. fail to compile on windows or many other non linux systems. Check
   how its done for posix_fadvise.

I will address as soon as I am able.

 * Is wal file creation performance actually relevant? Is the performance
   of a system running on fallocate()d wal files any different?

In my limited testing, I noticed a drop of approx. 100ms per WAL file.
I do not have a good idea for how to really stress the WAL-file
creation area without calling pg_start_backup and pg_stop_backup over
and over (with archiving enabled).

However, a file allocated with fallocate is (supposed to be) less
fragmented than one created by the traditional means.

 * According to the man page posix_fallocate doesn't set errno but rather
   returns the error code.

That's true. I originally wrote the patch using fallocate(2). What
would be appropriate here? Should I switch on the return value and the
six (6) or so relevant error codes?

 * I wonder whether we ever want to actually disable this? Afair the libc
   contains emulation for posix_fadvise if the filesystem doesn't support
   it.

I know that glibc does, but I don't know about other libc implementations.

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 On Wed, May 15, 2013 at 4:34 PM, Andres Freund and...@2ndquadrant.com wrote:
..
 Some where quick comments, without thinking about this:

 Thank you for the kind feedback.

 * needs a configure check for posix_fallocate. The current version will
   e.g. fail to compile on windows or many other non linux systems. Check
   how its done for posix_fadvise.

The following patch includes the changes to configure.in.
I had to make other changes (not included here) because my local
system uses autoconf 2.69, but I did test this successfully.

 That's true. I originally wrote the patch using fallocate(2). What
 would be appropriate here? Should I switch on the return value and the
 six (6) or so relevant error codes?

I addressed this, hopefully in a reasonable way.

--
Jon


fallocate.patch-v2
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Alvaro Herrera
Jon Nelson escribió:
 On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:

  That's true. I originally wrote the patch using fallocate(2). What
  would be appropriate here? Should I switch on the return value and the
  six (6) or so relevant error codes?
 
 I addressed this, hopefully in a reasonable way.

Would it work to just assign the value you got from posix_fallocate (if
nonzero) to errno and then use %m in the errmsg() call in ereport()?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-15 Thread Jon Nelson
On Wed, May 15, 2013 at 10:17 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Jon Nelson escribió:
 On Wed, May 15, 2013 at 4:46 PM, Jon Nelson jnelson+pg...@jamponi.net 
 wrote:

  That's true. I originally wrote the patch using fallocate(2). What
  would be appropriate here? Should I switch on the return value and the
  six (6) or so relevant error codes?

 I addressed this, hopefully in a reasonable way.

 Would it work to just assign the value you got from posix_fallocate (if
 nonzero) to errno and then use %m in the errmsg() call in ereport()?

That strikes me as a better way. I'll work something up soon.
Thanks!

--
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-14 Thread Robert Haas
On Mon, May 13, 2013 at 9:54 PM, Jon Nelson jnelson+pg...@jamponi.net wrote:
 Pertinent to another thread titled
 [HACKERS] corrupt pages detected by enabling checksums
 I hope to explore the possibility of using fallocate (or
 posix_fallocate) for new WAL file creation.

 Most modern Linux filesystems support fast fallocate/posix_fallocate,
 reducing extent fragmentation (where extents are used) and frequently
 offering a pretty significant speed improvement. In my tests, using
 posix_fallocate (followed by pg_fsync) is at least 28 times quicker
 than using the current method (which writes zeroes followed by
 pg_fsync).

 I have written up a patch to use posix_fallocate in new WAL file
 creation, including configuration by way of a GUC variable, but I've
 not contributed to the PostgreSQL project before. Therefore, I'm
 fairly certain the patch is not formatted properly or conforms to the
 appropriate style guides. Currently, the patch is based on 9.2, and is
 quite small in size - 3.6KiB.

 Advice on how to proceed is appreciated.

Make sure to list it here:

https://commitfest.postgresql.org/action/commitfest_view/open

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fallocate / posix_fallocate for new WAL file creation (etc...)

2013-05-13 Thread David Fetter
On Mon, May 13, 2013 at 08:54:39PM -0500, Jon Nelson wrote:
 Pertinent to another thread titled
 [HACKERS] corrupt pages detected by enabling checksums
 I hope to explore the possibility of using fallocate (or
 posix_fallocate) for new WAL file creation.
 
 Most modern Linux filesystems support fast fallocate/posix_fallocate,
 reducing extent fragmentation (where extents are used) and frequently
 offering a pretty significant speed improvement. In my tests, using
 posix_fallocate (followed by pg_fsync) is at least 28 times quicker
 than using the current method (which writes zeroes followed by
 pg_fsync).
 
 I have written up a patch to use posix_fallocate in new WAL file
 creation, including configuration by way of a GUC variable, but I've
 not contributed to the PostgreSQL project before. Therefore, I'm
 fairly certain the patch is not formatted properly or conforms to the
 appropriate style guides. Currently, the patch is based on 9.2, and is
 quite small in size - 3.6KiB.
 
 Advice on how to proceed is appreciated.

Thanks for hopping in!

Please re-base the patch vs. git master, as new features like this go
there.  Please also to send along the tests you're doing so others can
riff.  Tests that find any weak points are also good.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers