Re: [HACKERS] posix_fadvise missing in the walsender

2013-03-01 Thread Florian Weimer
* Jeff Janes:

 Does the kernel really read a data block from disk into memory in
 order to immediately overwrite it?  I would have thought it would
 optimize that away, at least if the writes are sized and aligned to
 512 or 1024 bytes blocks (which WAL should be).

With Linux, you'd have to use O_DIRECT to get that effect (but don't
do that), otherwise writes happen in page size granularity, writing in
512 or 1024 byte blocks should really trigger a read-modify-write
cycle.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-21 Thread Robert Haas
On Wed, Feb 20, 2013 at 9:49 PM, Joachim Wieland j...@mcknight.de wrote:
 So given the above, I think it's possible to come up with benchmarks
 that prove whatever you want to prove :-)

Yeah.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-21 Thread Jeff Janes
On Wed, Feb 20, 2013 at 7:54 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree with Merlin and Joachim - if we have the call in one place, we
 should have it in both.

 We might want to assess whether we even want to have it one place.
 I've seen cases where the existing call hurts performance, because of
 WAL file recycling.  If we don't flush the WAL file blocks out of
 cache, then they're still there when we recycle the WAL file and we
 can overwrite them without further I/O.  But if we tell the OS to blow
 them away, then it has to reread them when we try to overwrite the old
 files, and so we stall waiting for the I/O.

Does the kernel really read a data block from disk into memory in
order to immediately overwrite it?  I would have thought it would
optimize that away, at least if the writes are sized and aligned to
512 or 1024 bytes blocks (which WAL should be).  Well, stranger things
than that happen, I guess.  (For example on ext4, when a file with
dirty pages goes away due to another file getting renamed over the top
of it, the disappearing file automatically gets fsynced, or the
equivalent.)

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-21 Thread Robert Haas
On Thu, Feb 21, 2013 at 12:16 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Does the kernel really read a data block from disk into memory in
 order to immediately overwrite it?  I would have thought it would
 optimize that away, at least if the writes are sized and aligned to
 512 or 1024 bytes blocks (which WAL should be).

Now that you mention that I agree it seems strange, but that's what I saw.

/me scratches head

It does seem pretty odd, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-20 Thread Robert Haas
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree with Merlin and Joachim - if we have the call in one place, we
 should have it in both.

We might want to assess whether we even want to have it one place.
I've seen cases where the existing call hurts performance, because of
WAL file recycling.  If we don't flush the WAL file blocks out of
cache, then they're still there when we recycle the WAL file and we
can overwrite them without further I/O.  But if we tell the OS to blow
them away, then it has to reread them when we try to overwrite the old
files, and so we stall waiting for the I/O.  I was able to clearly
measure this problem back when I was hacking on write scalability, so
it's not a purely hypothetical risk.

As for the proposed optimization, I tend to doubt that it's a good
idea.  We're talking about doing extra work to give the OS cache a
hint that may not be right anyway.  Color me skeptical...  but like
Heikki, I'm certainly willing to be proven wrong by some actual
benchmark results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-20 Thread Joachim Wieland
On Wed, Feb 20, 2013 at 4:54 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree with Merlin and Joachim - if we have the call in one place, we
 should have it in both.

 We might want to assess whether we even want to have it one place.
 I've seen cases where the existing call hurts performance, because of
 WAL file recycling.

That's interesting, I hadn't thought about WAL recycling.

I now agree that this whole thing is even more complicated, you might
have an archive_command set as well, like cp for instance, that
reads in the WAL file again, possibly even right after we called
posix_fadvise on it.

It appears to me that the right strategy depends on a few factors:

a) what ratio of your active dataset fits into RAM?
b) how many WAL files do you have?
c) how long does it take for them to get recycled?
d) archive_command set / wal_senders active?

And recommendations for the two extremes would be:

If your dataset fits mostly into RAM and if you have only few WAL
files that get recycled quickly then you don't want to evict the WAL
file from the buffer cache.
On the other hand if your dataset doesn't fit into RAM and you have
many WAL files that take a while until they get recycled, then you
should consider hinting to the OS.

If you're in that second category (I am) and you're also using the
archive_command you could just piggyback the posix_fadvise call onto
your archive_command, assuming that the walsender is already done with
the file at that moment. And I'm also pretty certain that Robert's
setup that he used for the write scalability tests fell into the first
category.

So given the above, I think it's possible to come up with benchmarks
that prove whatever you want to prove :-)


Joachim


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-19 Thread Merlin Moncure
On Mon, Feb 18, 2013 at 2:16 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 17.02.2013 14:55, Joachim Wieland wrote:

 In access/transam/xlog.c we give the OS buffer caching a hint that we
 won't need a WAL file any time soon with

  posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);

 before closing the WAL file, but only if we don't have walsenders.
 That's reasonable because the walsender will reopen that same file
 shortly after.

 However the walsender doesn't call posix_fadvise once it's done with
 the file and I'm proposing to add this to walsender.c for consistency
 as well.

 Since there could be multiple walsenders, only the slowest one
 should call this function. Finding out the slowest walsender can be
 done by inspecting the shared memory and looking at the sentPtr of
 each walsender.


 I doubt it's worth it, the OS cache generally does a reasonable job at
 deciding what to keep. In the non-walsender case, it's pretty clear that we
 don't need the WAL file anymore, but if we need to work any harder than a
 one-line posix_fadvise call, meh.

If that's the case, why have the advisory call at all?  The OS is
being lied too (in some cases)...

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-19 Thread Simon Riggs
On 19 February 2013 20:19, Merlin Moncure mmonc...@gmail.com wrote:

 In access/transam/xlog.c we give the OS buffer caching a hint that we
 won't need a WAL file any time soon with

  posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);


 If that's the case, why have the advisory call at all?  The OS is
 being lied too (in some cases)...

I agree with Merlin and Joachim - if we have the call in one place, we
should have it in both.

This means that if a standby fails it will likely have to re-read
these files from disk. Cool, we can live with that.

Patch please,

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-19 Thread Joachim Wieland
On Tue, Feb 19, 2013 at 5:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
 In access/transam/xlog.c we give the OS buffer caching a hint that we
 won't need a WAL file any time soon with

  posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);


 I agree with Merlin and Joachim - if we have the call in one place, we
 should have it in both.

You could argue that if it's considered beneficial in the case with no
walsender, then you should definitely have it if there are walsenders
around:
The walsenders reopen and read those files which gives the OS reason
to believe that other processes might do the same in the near future
and hence that it should not evict those pages too early.


Joachim


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] posix_fadvise missing in the walsender

2013-02-18 Thread Heikki Linnakangas

On 17.02.2013 14:55, Joachim Wieland wrote:

In access/transam/xlog.c we give the OS buffer caching a hint that we
won't need a WAL file any time soon with

 posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);

before closing the WAL file, but only if we don't have walsenders.
That's reasonable because the walsender will reopen that same file
shortly after.

However the walsender doesn't call posix_fadvise once it's done with
the file and I'm proposing to add this to walsender.c for consistency
as well.

Since there could be multiple walsenders, only the slowest one
should call this function. Finding out the slowest walsender can be
done by inspecting the shared memory and looking at the sentPtr of
each walsender.


I doubt it's worth it, the OS cache generally does a reasonable job at 
deciding what to keep. In the non-walsender case, it's pretty clear that 
we don't need the WAL file anymore, but if we need to work any harder 
than a one-line posix_fadvise call, meh. I could be convinced otherwise 
with some performance test results, of course.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] posix_fadvise missing in the walsender

2013-02-17 Thread Joachim Wieland
In access/transam/xlog.c we give the OS buffer caching a hint that we
won't need a WAL file any time soon with

posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);

before closing the WAL file, but only if we don't have walsenders.
That's reasonable because the walsender will reopen that same file
shortly after.

However the walsender doesn't call posix_fadvise once it's done with
the file and I'm proposing to add this to walsender.c for consistency
as well.

Since there could be multiple walsenders, only the slowest one
should call this function. Finding out the slowest walsender can be
done by inspecting the shared memory and looking at the sentPtr of
each walsender.

Any comments?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers