Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-09 Thread Amit Kapila


 -Original Message-
 From: Robert Haas [mailto:robertmh...@gmail.com]
 Sent: Tuesday, April 09, 2013 9:28 AM
 To: Amit Kapila
 Cc: Greg Smith; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Page replacement algorithm in buffer cache
 
 On Fri, Apr 5, 2013 at 11:08 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  I still have one more doubt, consider the below scenario for cases
 when we
  Invalidate buffers during moving to freelist v/s just move to
 freelist
 
 Backend got the buffer from freelist for a request of page-9
 (number 9 is
  random, just to explain), it still have association with another
 page-10
 It needs to add the buffer with new tag (new page association) in
 bufhash
  table and remove the buffer with oldTag (old page association).
 
  The benefit for just moving to freelist is that if we get request of
 same
  page until somebody else used it for another page, it will save read
 I/O.
  However on the other side for many cases
  Backend will need extra partition lock to remove oldTag (which can
 lead to
  some bottleneck).
 
  I think saving read I/O is more beneficial but just not sure if that
 is best
  as cases might be less for it.
 
 I think saving read I/O is a lot more beneficial.  I haven't seen
 evidence of a severe bottleneck updating the buffer mapping tables.  I
 have seen some evidence of spinlock-level contention on read workloads
 that fit in shared buffers, because in that case the system can run
 fast enough for the spinlocks protecting the lwlocks to get pretty
 hot.  But if you're doing writes, or if the workload doesn't fit in
 shared buffers, other bottlenecks slow you down enough that this
 doesn't really seem to become much of an issue.
 
 Also, even if you *can* find some scenario where pushing the buffer
 invalidation into the background is a win, I'm not convinced that
 would justify doing it, because the case where it's a huge loss -
 namely, working set just a tiny bit smaller than shared_buffers - is
 pretty obvious. I don't think we dare fool around with that; the
 townspeople will arrive with pitchforks.
 
 I believe that the big win here is getting the clock sweep out of the
 foreground so that BufFreelistLock doesn't catch fire.  The buffer
 mapping locks are partitioned and, while it's not like that completely
 gets rid of the contention, it sure does help a lot.  So I would view
 that goal as primary, at least for now.  If we get a first round of
 optimization done in this area, that doesn't preclude further
 improving it in the future.

I agree with you that this can be first step towards improvement.

  Last time following tests have been executed to validate the results:
 
  Test suite - pgbench
  DB Size - 16 GB
  RAM - 24 GB
  Shared Buffers - 2G, 5G, 7G, 10G
  Concurrency - 8, 16, 32, 64 clients
  Pre-warm the buffers before start of test
 
  Shall we try for any other scenario's or for initial test of patch
 above are
  okay.
 
 Seems like a reasonable place to start.

I shall work on this for first CF of 9.4.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-08 Thread Robert Haas
On Fri, Apr 5, 2013 at 11:08 PM, Amit Kapila amit.kap...@huawei.com wrote:
 I still have one more doubt, consider the below scenario for cases when we
 Invalidate buffers during moving to freelist v/s just move to freelist

Backend got the buffer from freelist for a request of page-9 (number 9 is
 random, just to explain), it still have association with another page-10
It needs to add the buffer with new tag (new page association) in bufhash
 table and remove the buffer with oldTag (old page association).

 The benefit for just moving to freelist is that if we get request of same
 page until somebody else used it for another page, it will save read I/O.
 However on the other side for many cases
 Backend will need extra partition lock to remove oldTag (which can lead to
 some bottleneck).

 I think saving read I/O is more beneficial but just not sure if that is best
 as cases might be less for it.

I think saving read I/O is a lot more beneficial.  I haven't seen
evidence of a severe bottleneck updating the buffer mapping tables.  I
have seen some evidence of spinlock-level contention on read workloads
that fit in shared buffers, because in that case the system can run
fast enough for the spinlocks protecting the lwlocks to get pretty
hot.  But if you're doing writes, or if the workload doesn't fit in
shared buffers, other bottlenecks slow you down enough that this
doesn't really seem to become much of an issue.

Also, even if you *can* find some scenario where pushing the buffer
invalidation into the background is a win, I'm not convinced that
would justify doing it, because the case where it's a huge loss -
namely, working set just a tiny bit smaller than shared_buffers - is
pretty obvious. I don't think we dare fool around with that; the
townspeople will arrive with pitchforks.

I believe that the big win here is getting the clock sweep out of the
foreground so that BufFreelistLock doesn't catch fire.  The buffer
mapping locks are partitioned and, while it's not like that completely
gets rid of the contention, it sure does help a lot.  So I would view
that goal as primary, at least for now.  If we get a first round of
optimization done in this area, that doesn't preclude further
improving it in the future.

 Last time following tests have been executed to validate the results:

 Test suite - pgbench
 DB Size - 16 GB
 RAM - 24 GB
 Shared Buffers - 2G, 5G, 7G, 10G
 Concurrency - 8, 16, 32, 64 clients
 Pre-warm the buffers before start of test

 Shall we try for any other scenario's or for initial test of patch above are
 okay.

Seems like a reasonable place to start.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-05 Thread Robert Haas
On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila amit.kap...@huawei.com wrote:
 If we just put it to freelist, then next time if it get allocated directly
 from bufhash table, then who will remove it from freelist
 or do you think that, in BufferAlloc, if it gets from bufhash table, then it
 should verify if it's in freelist, then remove from freelist.

No, I don't think that's necessary.  We already have the following
guard in StrategyGetBuffer:

if (buf-refcount == 0  buf-usage_count == 0)
{
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}

If a buffer is allocated from the freelist and it turns out that it
actually has a non-zero reference count or a non-zero pin count, we
just discard it and pull the next buffer off the freelist instead.
So, in the scenario you describe, the buffer gets reallocated (due to
a non-NULL BufferAccessStrategy, presumably) and then somebody comes a
long and pulls it off the freelist.  But, since the buffer has just
been used by someone else, it'll most likely be pinned or have a
non-zero usage count, so we'll just skip it and allocate some other
buffer instead.  No harm done.

Now, it is possible that the buffer could get added to the freelist,
then allocated via a BufferAccessStrategy, and then the clock sweep
could hit it and push the usage count back to 0.  But that's no big
deal either: if we go to put it on the freelist and see (via
buf-freeNext) that it's already there, we can just leave it where it
is (or maybe move it to the end).  On a related note, we probably need
a variant of StrategyFreeBuffer which pushes buffers onto the end of
the freelist rather than the front.  It makes sense to stick
invalidated buffers on the front of the list (which is what
StrategyFreeBuffer does), but non-invalidated buffers should be placed
at the end to more closely approximate LRU.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-05 Thread Amit Kapila
On Saturday, April 06, 2013 12:38 AM Robert Haas wrote:
 On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  If we just put it to freelist, then next time if it get allocated
 directly
  from bufhash table, then who will remove it from freelist
  or do you think that, in BufferAlloc, if it gets from bufhash table,
 then it
  should verify if it's in freelist, then remove from freelist.
 
 No, I don't think that's necessary.  We already have the following
 guard in StrategyGetBuffer:
 
 if (buf-refcount == 0  buf-usage_count == 0)
 {
 if (strategy != NULL)
 AddBufferToRing(strategy, buf);
 return buf;
 }
 
 If a buffer is allocated from the freelist and it turns out that it
 actually has a non-zero reference count or a non-zero pin count, we
 just discard it and pull the next buffer off the freelist instead.
 So, in the scenario you describe, the buffer gets reallocated (due to
 a non-NULL BufferAccessStrategy, presumably) and then somebody comes a
 long and pulls it off the freelist.  But, since the buffer has just
 been used by someone else, it'll most likely be pinned or have a
 non-zero usage count, so we'll just skip it and allocate some other
 buffer instead.  No harm done.

Yes, you are right, I have missed that part of code while thinking of this
scenario, but I was talking about NULL BufferAccessStrategy as well.

I still have one more doubt, consider the below scenario for cases when we
Invalidate buffers during moving to freelist v/s just move to freelist

   Backend got the buffer from freelist for a request of page-9 (number 9 is
random, just to explain), it still have association with another page-10
   It needs to add the buffer with new tag (new page association) in bufhash
table and remove the buffer with oldTag (old page association).

The benefit for just moving to freelist is that if we get request of same
page until somebody else used it for another page, it will save read I/O.
However on the other side for many cases
Backend will need extra partition lock to remove oldTag (which can lead to
some bottleneck).

I think saving read I/O is more beneficial but just not sure if that is best
as cases might be less for it.
 
 Now, it is possible that the buffer could get added to the freelist,
 then allocated via a BufferAccessStrategy, and then the clock sweep
 could hit it and push the usage count back to 0.  But that's no big
 deal either: if we go to put it on the freelist and see (via
 buf-freeNext) that it's already there, we can just leave it where it
 is (or maybe move it to the end).  On a related note, we probably need
 a variant of StrategyFreeBuffer which pushes buffers onto the end of
 the freelist rather than the front.  It makes sense to stick
 invalidated buffers on the front of the list (which is what
 StrategyFreeBuffer does), but non-invalidated buffers should be placed
 at the end to more closely approximate LRU.

Okay.

Last time following tests have been executed to validate the results:

Test suite - pgbench
DB Size - 16 GB 
RAM - 24 GB
Shared Buffers - 2G, 5G, 7G, 10G
Concurrency - 8, 16, 32, 64 clients
Pre-warm the buffers before start of test

Shall we try for any other scenario's or for initial test of patch above are
okay.


With Regards,
Amit Kapila.








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-04 Thread Amit Kapila
On Thursday, April 04, 2013 7:19 AM Greg Smith wrote:
 On 4/2/13 11:54 AM, Robert Haas wrote:
  But, having said that, I still think the best idea is what Andres
  proposed, which pretty much matches my own thoughts: the bgwriter
  needs to populate the free list, so that buffer allocations don't
 have
  to wait for linear scans of the buffer array.
 
 I was hoping this one would make it to a full six years of being on the
 TODO list before it came up again, missed it by a few weeks.  The
 funniest part is that Amit even submitted a patch on this theme a few
 months ago without much feedback:
 http://www.postgresql.org/message-
 id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
   That stalled where a few things have, on a) needing more regression
 test workloads, and b) wondering just what the deal with large
 shared_buffers setting degrading performance was.

For b), below are links where it decreased due to large shared buffers.

http://www.postgresql.org/message-id/attachment/27489/Results.htm
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C38285442C
5@szxeml509-mbx


As per my observation, it occur when I/O starts. The dip could be due to
fluctuation or may be due to some OS scheduling or it could be due to
Eviction of dirty pages sooner than it would otherwise.

I think the further investigation can be more meaningful if the results can
be taken by someone else other than me.

One idea to proceed in this line could be we start with this patch and then
based on results, do the further experiments to make it more useful.  

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-04 Thread Robert Haas
On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 4/2/13 11:54 AM, Robert Haas wrote:
 But, having said that, I still think the best idea is what Andres
 proposed, which pretty much matches my own thoughts: the bgwriter
 needs to populate the free list, so that buffer allocations don't have
 to wait for linear scans of the buffer array.

 I was hoping this one would make it to a full six years of being on the TODO
 list before it came up again, missed it by a few weeks.  The funniest part
 is that Amit even submitted a patch on this theme a few months ago without
 much feedback:
 http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
 That stalled where a few things have, on a) needing more regression test
 workloads, and b) wondering just what the deal with large shared_buffers
 setting degrading performance was.

Those are impressive results.  I think we should seriously consider
doing something like that for 9.4.  TBH, although more workloads to
test is always better, I don't think this problem is so difficult that
we can't have some confidence in a theoretical analysis.  If I read
the original thread correctly (and I haven't looked at the patch
itself), the proposed patch would actually invalidate buffers before
putting them on the freelist.  That effectively amounts to reducing
shared_buffers, so workloads that are just on the edge of what can fit
in shared_buffers will be harmed, and those that benefit incrementally
from increased shared_buffers will be as well.

What I think we should do instead is collect the buffers that we think
are evictable and stuff them onto the freelist without invalidating
them.  When a backend allocates from the freelist, it can double-check
that the buffer still has usage_count 0.  The odds should be pretty
good.  But even if we sometimes notice that the buffer has been
touched again after being put on the freelist, we haven't expended all
that much extra effort, and that effort happened mostly in the
background.  Consider a scenario where only 10% of the buffers have
usage count 0 (which is not unrealistic).  We scan 5000 buffers and
put 500 on the freelist.  Now suppose that, due to some accident of
the workload, 75% of those buffers get touched again before they're
allocated off the freelist (which I believe to be a pessimistic
estimate for most workloads).  Now, that means that only 125 of those
500 buffers will succeed in satisfying an allocation request.  That's
still a huge win, because it means that each backend only has examine
an average of 4 buffers before it finds one to allocate.  If it had
needed to do the freelist scan itself, it would have had to touch 40
buffers before finding one to allocate.

In real life, I think the gains are apt to be, if anything, larger.
IME, it's common for most or all of the buffer pool to be pinned at
usage count 5.  So you could easily have a situation where the arena
scan has to visit millions of buffers to find one to allocate.  If
that's happening in the background instead of the foreground, it's a
huge win.  Also, note that there's nothing to prevent the arena scan
from happening in parallel with allocations off of the freelist - so
while foreground processes are emptying the freelist, the background
process can be looking for more things to add to it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-04 Thread Amit Kapila
On Thursday, April 04, 2013 6:12 PM Robert Haas wrote:
 On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith g...@2ndquadrant.com
 wrote:
  On 4/2/13 11:54 AM, Robert Haas wrote:
  But, having said that, I still think the best idea is what Andres
  proposed, which pretty much matches my own thoughts: the bgwriter
  needs to populate the free list, so that buffer allocations don't
 have
  to wait for linear scans of the buffer array.
 
  I was hoping this one would make it to a full six years of being on
 the TODO
  list before it came up again, missed it by a few weeks.  The funniest
 part
  is that Amit even submitted a patch on this theme a few months ago
 without
  much feedback:
  http://www.postgresql.org/message-
 id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
  That stalled where a few things have, on a) needing more regression
 test
  workloads, and b) wondering just what the deal with large
 shared_buffers
  setting degrading performance was.
 
 Those are impressive results.  I think we should seriously consider
 doing something like that for 9.4.  TBH, although more workloads to
 test is always better, I don't think this problem is so difficult that
 we can't have some confidence in a theoretical analysis.  If I read
 the original thread correctly (and I haven't looked at the patch
 itself), the proposed patch would actually invalidate buffers before
 putting them on the freelist.  That effectively amounts to reducing
 shared_buffers, so workloads that are just on the edge of what can fit
 in shared_buffers will be harmed, and those that benefit incrementally
 from increased shared_buffers will be as well.
 
 What I think we should do instead is collect the buffers that we think
 are evictable and stuff them onto the freelist without invalidating
 them.  When a backend allocates from the freelist, it can double-check
 that the buffer still has usage_count 0.  The odds should be pretty
 good.  But even if we sometimes notice that the buffer has been
 touched again after being put on the freelist, we haven't expended all
 that much extra effort, and that effort happened mostly in the
 background.  

If we just put it to freelist, then next time if it get allocated directly
from bufhash table, then who will remove it from freelist
or do you think that, in BufferAlloc, if it gets from bufhash table, then it
should verify if it's in freelist, then remove from freelist.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-03 Thread Robert Haas
On Tue, Apr 2, 2013 at 1:20 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-02 12:56:56 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
  I agree in general, though I'm not sure the bgwriter process can
  reasonably handle this need along with what it's already supposed to be
  doing.  We may need another background process that is just responsible
  for keeping the freelist populated.

  What else is the bgwriter actually doing otherwise? Sure, it doesn't put 
  the
  pages on the freelist, but otherwise its trying to do the above isn't it?

 I think it will be problematic to tie buffer-undirtying to putting both
 clean and dirty buffers into the freelist.  It might chance to work all
 right to use one scan process for both, but I'm afraid it's more likely
 that we'd end up either overserving one goal or underserving the other.

 Hm. I had imagined that we would only ever put clean buffers into the
 freelist and that we would never write out a buffer that we don't need
 for a new page. I don't see much point in randomly writing out buffers
 that won't be needed for something else soon. Currently we can't do much
 better than basically undirtying random buffers since we don't really know
 which page will reach a usagecount of zero since bgwriter doesn't
 manipulate usagecounts.

 One other scenario I can see is the problem of strategy buffer reusage
 of dirtied pages (hint bits, pruning) during seqscans where we would
 benefit from pages being written out fast, but I can't imagine that that
 could be handled very well by something like the bgwriter?

 Am I missing something?

I've had the same thought.  I think we should consider having a
background process that listens on a queue, sort of like the fsync
absorption queue.  When a process using a buffer access strategy
dirties a buffer, it adds it to that queue and sets the latch for the
background process, which then wakes up and starts cleaning the
buffers that have been added to its queue.  The hope is that, by the
time the ring buffer wraps around, the background process will have
cleaned the buffer, preventing the foreground process from having to
wait for the buffer write (and, perhaps, xlog flush).

The main hesitation I've had about actually implementing such a scheme
is that I find it a bit unappealing to have a background process
dedicated to just this.  But maybe it could be combined with some of
the other ideas presented here.  Perhaps we should have one process
that scans the buffer arena and populates the freelists; as a side
effect, if it runs across a dirty buffer, it kicks it over to the
process described in the previous paragraph (which could still, also,
absorb requests from other backends using buffer access strategies).
Then we'd end up with nothing that looks exactly like the background
writer we have now, but maybe no one would miss it.

I think that as we go through the process of trying to improve this,
we should also look hard at trying to make the algorithms more
self-tuning.  For example, instead of having a fixed delay between
rounds for the buffer-arena-scanning process, I think we should try to
make it adaptive.  If it sticks a bunch of buffers on the freelist and
the freelist then runs dry before it wakes up again, the backend that
notices that the list is empty (or below some low watermark), it
should set a latch to wake up the buffer-arena-scanning process; and
the next time that process goes back to sleep, it should sleep for a
shorter period of time.  As things are today, what the background
writer actually does is unhelpful enough that there might not be much
point in fiddling with this, but as we get to having a more sensible
scheme overall, I think it will pay dividends.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-03 Thread Greg Stark
On Wed, Apr 3, 2013 at 3:00 PM, Robert Haas robertmh...@gmail.com wrote:

 The main hesitation I've had about actually implementing such a scheme
 is that I find it a bit unappealing to have a background process
 dedicated to just this.  But maybe it could be combined with some of
 the other ideas presented here.  Perhaps we should have one process
 that scans the buffer arena and populates the freelists; as a side
 effect, if it runs across a dirty buffer, it kicks it over to the
 process described in the previous paragraph (which could still, also,
 absorb requests from other backends using buffer access strategies).
 Then we'd end up with nothing that looks exactly like the background
 writer we have now, but maybe no one would miss it.


I think the general pattern of development has led in the opposite
direction. Every time there's been one daemon responsible for two things
it's ended up causing contorted code and difficult to tune behaviours and
we've ended up splitting the two.

In particular in this case it seems like an especially poor choice. In the
time one buffer write might take the entire freelist might empty out. I
could easily imagine this happening *every* time a write I/O happens. It
seems more likely that you'll need multiple processes running the buffer
cleaning to keep up with a decent I/O subsystem.

I'm still skeptical about the idea of a freelist. That just seems like a
terrible point of contention. However perhaps that's because I'm picturing
an LRU linked list. Perhaps the right thing is to maintain a pool of
buffers in some less contention-prone data structure which lets each
backend pick buffers out more or less independently of the others.

-- 
greg


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-03 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 I'm still skeptical about the idea of a freelist. That just seems like a
 terrible point of contention. However perhaps that's because I'm picturing
 an LRU linked list. Perhaps the right thing is to maintain a pool of
 buffers in some less contention-prone data structure which lets each
 backend pick buffers out more or less independently of the others.

I think the original vision of the clock sweep algorithm included the
idea that different backends could be running the sweep over different
parts of the buffer ring concurrently.  If we could get rid of the need
to serialize that activity, it'd pretty much eliminate the bottleneck
I should think.  The problem is how to manage it to ensure that (a)
backends aren't actually contending on the same buffers as they do this,
and (b) there's a reasonably fair rate of usage_count decrements for
each buffer, rather than possibly everybody ganging up on the same area
sometimes.  Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-03 Thread Ants Aasma
On Thu, Apr 4, 2013 at 3:41 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think the original vision of the clock sweep algorithm included the
 idea that different backends could be running the sweep over different
 parts of the buffer ring concurrently.  If we could get rid of the need
 to serialize that activity, it'd pretty much eliminate the bottleneck
 I should think.  The problem is how to manage it to ensure that (a)
 backends aren't actually contending on the same buffers as they do this,
 and (b) there's a reasonably fair rate of usage_count decrements for
 each buffer, rather than possibly everybody ganging up on the same area
 sometimes.  Thoughts?

Fairness and avoiding each others implies that some coordination is
required. One wild idea I had is to partition buffers to N partitions
and have one clock sweep each, protected by a lwlock.

To reduce contention, the clocksweep runners use something like
sched_getcpu() to determine which partition to use to find their
buffer. Using the CPU to determine the partition makes it necessary
for the process to be scheduled out in the critical section for some
other backend to contend on the lock. And if some backend does contend
on it, it is likely to reside on the same CPU and by sleeping will
make room for the lockholder to run.

To ensure fairness for buffers, every time one of the clocksweeps
wraps around a global offset counter is incremented. This re-assigns
all cpus/backends to the next partition, sort of like the mad hatters
tea party.

The scenario that I'm most worried about here is what happens when a
process holding the clocksweep lock is migrated to another CPU and
then scheduled out. The processes on the original CPU will sooner or
later block behind the lock while the processes on the CPU where the
lock holder now waits keep hogging the CPU. This will create an
imbalance that the scheduler might try to correct, possibly creating a
nasty feedback loop. It could be that in practice the scenario works
out to be too far fetched to matter, but who knows.

I don't have a slightest idea yet how the background writer would
function in this environment. But if redesigning the bgwriter
mechanism was on the table I thought I would throw the idea out here.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-03 Thread Greg Smith

On 4/2/13 11:54 AM, Robert Haas wrote:

But, having said that, I still think the best idea is what Andres
proposed, which pretty much matches my own thoughts: the bgwriter
needs to populate the free list, so that buffer allocations don't have
to wait for linear scans of the buffer array.


I was hoping this one would make it to a full six years of being on the 
TODO list before it came up again, missed it by a few weeks.  The 
funniest part is that Amit even submitted a patch on this theme a few 
months ago without much feedback: 
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs 
 That stalled where a few things have, on a) needing more regression 
test workloads, and b) wondering just what the deal with large 
shared_buffers setting degrading performance was.


I saw refactoring in this area as waiting behind it being easier to 
experiment with adding new processes, but that barrier has fallen now. 
Maybe it needs a new freelist process, maybe it doesn't, today the code 
needed to try both is relatively cheap.


The other thing that always seemed to stop me was never having a typical 
Linux system big enough to hit some of these problems available all the 
time.  What I did this week on that front was just go buy a 24 core 
server with 64GB of RAM that lives in my house.  I just need to keep it 
two floors away if I want to sleep at night.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Andres Freund
On 2013-04-01 17:56:19 -0500, Jim Nasby wrote:
 On 3/23/13 7:41 AM, Ants Aasma wrote:
 Yes, having bgwriter do the actual cleaning up seems like a good idea.
 The whole bgwriter infrastructure will need some serious tuning. There
 are many things that could be shifted to background if we knew it
 could keep up, like hint bit setting on dirty buffers being flushed
 out. But again, we have the issue of having good tests to see where
 the changes hurt.
 
 I think at some point we need to stop depending on just bgwriter for all this 
 stuff. I believe it would be much cleaner if we had separate procs for 
 everything we needed (although some synergies might exist; if we wanted to 
 set hint bits during write then bgwriter *is* the logical place to put that).
 
 In this case, I don't think keeping stuff on the free list is close enough to 
 checkpoints that we'd want bgwriter to handle both. At most we might want 
 them to pass some metrics back in forth.

bgwriter isn't doing checkpoints anymore, there's the checkpointer since 9.2.

In my personal experience and measurement bgwriter is pretty close to
useless right now. I think - pretty similar to what Amit has done - it
should perform part of a real clock sweep instead of just looking ahead
of the current position without changing usagecounts and the sweep
position and put enough buffers on the freelist to sustain the need till
its next activity phase. I hacked around that one night in a hotel and
got impressive speedups (and quite some breakage) for bigger than s_b
workloads.

That would reduce quite a bit of pain points:
- fewer different processes/cpus looking at buffer headers ahead in the cycle
- fewer cpus changing usagecounts
- dirty pages are far more likely to be flushed out already when a new
  page is needed
- stuff like the relation extension lock which right now frequently have
  to search far and wide for new pages while holding the extension lock
  exlusively should finish quite a bit faster

If the freelist lock is separated from the lock protecting the clock
sweep this should get quite a bit of a scalability boost without having
potential unfairness you can have with partitioning the lock or such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Robert Haas
On Tue, Apr 2, 2013 at 6:32 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-01 17:56:19 -0500, Jim Nasby wrote:
 On 3/23/13 7:41 AM, Ants Aasma wrote:
 Yes, having bgwriter do the actual cleaning up seems like a good idea.
 The whole bgwriter infrastructure will need some serious tuning. There
 are many things that could be shifted to background if we knew it
 could keep up, like hint bit setting on dirty buffers being flushed
 out. But again, we have the issue of having good tests to see where
 the changes hurt.

 I think at some point we need to stop depending on just bgwriter for all 
 this stuff. I believe it would be much cleaner if we had separate procs for 
 everything we needed (although some synergies might exist; if we wanted to 
 set hint bits during write then bgwriter *is* the logical place to put that).

 In this case, I don't think keeping stuff on the free list is close enough 
 to checkpoints that we'd want bgwriter to handle both. At most we might want 
 them to pass some metrics back in forth.

 bgwriter isn't doing checkpoints anymore, there's the checkpointer since 9.2.

 In my personal experience and measurement bgwriter is pretty close to
 useless right now. I think - pretty similar to what Amit has done - it
 should perform part of a real clock sweep instead of just looking ahead
 of the current position without changing usagecounts and the sweep
 position and put enough buffers on the freelist to sustain the need till
 its next activity phase. I hacked around that one night in a hotel and
 got impressive speedups (and quite some breakage) for bigger than s_b
 workloads.

 That would reduce quite a bit of pain points:
 - fewer different processes/cpus looking at buffer headers ahead in the cycle
 - fewer cpus changing usagecounts
 - dirty pages are far more likely to be flushed out already when a new
   page is needed
 - stuff like the relation extension lock which right now frequently have
   to search far and wide for new pages while holding the extension lock
   exlusively should finish quite a bit faster

 If the freelist lock is separated from the lock protecting the clock
 sweep this should get quite a bit of a scalability boost without having
 potential unfairness you can have with partitioning the lock or such.

I agree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Robert Haas
On Tue, Apr 2, 2013 at 1:53 AM, Merlin Moncure mmonc...@gmail.com wrote:
 That seems pretty unlikely because of A sheer luck of hitting that
 page for the dropout (if your buffer count is N the chances of losing
 it would seem to be 1/N at most) and B highly used pages are much more
 likely to be pinned and thus immune from eviction.  But my issue with
 this whole line of analysis is that I've never been able to to turn it
 up in simulated testing.   Probably to do it you'd need very very fast
 storage.

Well, if you have shared_buffers=8GB, that's a million buffers.  One
in a million events happen pretty frequently on a heavily loaded
server, which, on recent versions of PostgreSQL, can support several
hundred thousand queries per second, each of which accesses multiple
buffers.

I've definitely seen evidence that poor choices of which CLOG buffer
to evict can result in a noticeable system-wide stall while everyone
waits for it to be read back in.  I don't have any similar evidence
for shared buffers, but I wouldn't be very surprised if the same
danger exists there, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Merlin Moncure
On Tue, Apr 2, 2013 at 9:55 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Apr 2, 2013 at 1:53 AM, Merlin Moncure mmonc...@gmail.com wrote:
 That seems pretty unlikely because of A sheer luck of hitting that
 page for the dropout (if your buffer count is N the chances of losing
 it would seem to be 1/N at most) and B highly used pages are much more
 likely to be pinned and thus immune from eviction.  But my issue with
 this whole line of analysis is that I've never been able to to turn it
 up in simulated testing.   Probably to do it you'd need very very fast
 storage.

 Well, if you have shared_buffers=8GB, that's a million buffers.  One
 in a million events happen pretty frequently on a heavily loaded
 server, which, on recent versions of PostgreSQL, can support several
 hundred thousand queries per second, each of which accesses multiple
 buffers.

 I've definitely seen evidence that poor choices of which CLOG buffer
 to evict can result in a noticeable system-wide stall while everyone
 waits for it to be read back in.  I don't have any similar evidence
 for shared buffers, but I wouldn't be very surprised if the same
 danger exists there, too.

That's a very fair point, although not being able to evict pinned
buffers is a highly mitigating aspect.  Also CLOG is a different beast
entirely -- it's much more dense (2 bits!) vs a tuple so a single page
can a lot of high priority things.  But you could be right anyways.

Given that, I wouldn't feel very comfortable with forced eviction
without knowing for sure high priority buffers were immune from that.
Your nailing idea is maybe the ideal solution.   Messing around with
the usage_count mechanic is tempting (like raising the cap and making
the sweeper more aggressive as it iterates), but probably really
difficult to get right, and, hopefully, ultimately moot.

 merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Robert Haas
On Tue, Apr 2, 2013 at 11:32 AM, Merlin Moncure mmonc...@gmail.com wrote:
 That's a very fair point, although not being able to evict pinned
 buffers is a highly mitigating aspect.  Also CLOG is a different beast
 entirely -- it's much more dense (2 bits!) vs a tuple so a single page
 can a lot of high priority things.  But you could be right anyways.

 Given that, I wouldn't feel very comfortable with forced eviction
 without knowing for sure high priority buffers were immune from that.
 Your nailing idea is maybe the ideal solution.   Messing around with
 the usage_count mechanic is tempting (like raising the cap and making
 the sweeper more aggressive as it iterates), but probably really
 difficult to get right, and, hopefully, ultimately moot.

One thought I had for fiddling with usage_count is to make it grow
additively (x = x + 1) and decay exponentially (x = x  1).  I'm not
sure the idea is any good, but one problem with the current system is
that it's pretty trivial for a buffer to accumulate five touches, and
after that we lose all memory of what the frequency of access is, so a
pages of varying different levels of hotness can all have usage
count 5.  This might allow a little more refinement without letting
the time to degrade the usage count get out of control.

But, having said that, I still think the best idea is what Andres
proposed, which pretty much matches my own thoughts: the bgwriter
needs to populate the free list, so that buffer allocations don't have
to wait for linear scans of the buffer array.  That's just plain too
slow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 But, having said that, I still think the best idea is what Andres
 proposed, which pretty much matches my own thoughts: the bgwriter
 needs to populate the free list, so that buffer allocations don't have
 to wait for linear scans of the buffer array.  That's just plain too
 slow.

I agree in general, though I'm not sure the bgwriter process can
reasonably handle this need along with what it's already supposed to be
doing.  We may need another background process that is just responsible
for keeping the freelist populated.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Andres Freund
On 2013-04-02 11:54:32 -0400, Robert Haas wrote:
 On Tue, Apr 2, 2013 at 11:32 AM, Merlin Moncure mmonc...@gmail.com wrote:
  That's a very fair point, although not being able to evict pinned
  buffers is a highly mitigating aspect.  Also CLOG is a different beast
  entirely -- it's much more dense (2 bits!) vs a tuple so a single page
  can a lot of high priority things.  But you could be right anyways.
 
  Given that, I wouldn't feel very comfortable with forced eviction
  without knowing for sure high priority buffers were immune from that.
  Your nailing idea is maybe the ideal solution.   Messing around with
  the usage_count mechanic is tempting (like raising the cap and making
  the sweeper more aggressive as it iterates), but probably really
  difficult to get right, and, hopefully, ultimately moot.
 
 One thought I had for fiddling with usage_count is to make it grow
 additively (x = x + 1) and decay exponentially (x = x  1).  I'm not
 sure the idea is any good, but one problem with the current system is
 that it's pretty trivial for a buffer to accumulate five touches, and
 after that we lose all memory of what the frequency of access is, so a
 pages of varying different levels of hotness can all have usage
 count 5.  This might allow a little more refinement without letting
 the time to degrade the usage count get out of control.
 
 But, having said that, I still think the best idea is what Andres
 proposed, which pretty much matches my own thoughts: the bgwriter
 needs to populate the free list, so that buffer allocations don't have
 to wait for linear scans of the buffer array.  That's just plain too
 slow.

That way the usagecount should go down far more slowly which essentially makes
it more granular. And I think fiddling on that level before we have a more
sensible buffer acquiration implementation is pretty premature since that will
change too much.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Andres Freund
On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  But, having said that, I still think the best idea is what Andres
  proposed, which pretty much matches my own thoughts: the bgwriter
  needs to populate the free list, so that buffer allocations don't have
  to wait for linear scans of the buffer array.  That's just plain too
  slow.
 
 I agree in general, though I'm not sure the bgwriter process can
 reasonably handle this need along with what it's already supposed to be
 doing.  We may need another background process that is just responsible
 for keeping the freelist populated.

What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
pages on the freelist, but otherwise its trying to do the above isn't it?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
 I agree in general, though I'm not sure the bgwriter process can
 reasonably handle this need along with what it's already supposed to be
 doing.  We may need another background process that is just responsible
 for keeping the freelist populated.

 What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
 pages on the freelist, but otherwise its trying to do the above isn't it?

I think it will be problematic to tie buffer-undirtying to putting both
clean and dirty buffers into the freelist.  It might chance to work all
right to use one scan process for both, but I'm afraid it's more likely
that we'd end up either overserving one goal or underserving the other.

Also note the entire design of BgBufferSync right now is predicated on
the assumption that the rate of motion of the scan strategy point
reflects the rate at which new buffers are needed.  If backends are
supposed to always get new buffers from the freelist, that logic becomes
circular: the bgwriter would be watching itself.  Perhaps we can
refactor the feedback control loop logic so that the buffer scan rate is
driven by changes in the length of the freelist, but I'm not sure
exactly what the consequences would be.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Andres Freund
On 2013-04-02 12:56:56 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-04-02 12:22:03 -0400, Tom Lane wrote:
  I agree in general, though I'm not sure the bgwriter process can
  reasonably handle this need along with what it's already supposed to be
  doing.  We may need another background process that is just responsible
  for keeping the freelist populated.
 
  What else is the bgwriter actually doing otherwise? Sure, it doesn't put the
  pages on the freelist, but otherwise its trying to do the above isn't it?
 
 I think it will be problematic to tie buffer-undirtying to putting both
 clean and dirty buffers into the freelist.  It might chance to work all
 right to use one scan process for both, but I'm afraid it's more likely
 that we'd end up either overserving one goal or underserving the other.

Hm. I had imagined that we would only ever put clean buffers into the
freelist and that we would never write out a buffer that we don't need
for a new page. I don't see much point in randomly writing out buffers
that won't be needed for something else soon. Currently we can't do much
better than basically undirtying random buffers since we don't really know
which page will reach a usagecount of zero since bgwriter doesn't
manipulate usagecounts.

One other scenario I can see is the problem of strategy buffer reusage
of dirtied pages (hint bits, pruning) during seqscans where we would
benefit from pages being written out fast, but I can't imagine that that
could be handled very well by something like the bgwriter?

Am I missing something?

 Also note the entire design of BgBufferSync right now is predicated on
 the assumption that the rate of motion of the scan strategy point
 reflects the rate at which new buffers are needed.  If backends are
 supposed to always get new buffers from the freelist, that logic becomes
 circular: the bgwriter would be watching itself.  Perhaps we can
 refactor the feedback control loop logic so that the buffer scan rate is
 driven by changes in the length of the freelist, but I'm not sure
 exactly what the consequences would be.

Yea, that will definitely need refactoring. What I am imagining is that
that pacing is basically built ontop of a few StragetyControl variables
like:
* number of buffers from the freelist
* number of buffers acquired by backend because freelist was empty
* number of buffers written out by backend because freelist was empty

Those should be pretty cheap to maintain and should be enough to
implement sensible pacing for bgwriter.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Greg Stark
I'm confused by this thread. We *used* to maintain an LRU. The whole
reason for the clock-sweep algorithm is precisely to avoid maintaining
a linked list of least recently used buffers since the head of that
list is a point of contention.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Andres Freund
On 2013-04-02 18:26:23 +0100, Greg Stark wrote:
 I'm confused by this thread. We *used* to maintain an LRU. The whole
 reason for the clock-sweep algorithm is precisely to avoid maintaining
 a linked list of least recently used buffers since the head of that
 list is a point of contention.

I don't think anybody is proposing to put the LRU back into a linked list,
given the frequency of usagecount manipulations that would probably end pretty
badly. What I think Robert, Tom and I are talking are talking about is putting
*some* buffers with usagecount = 0 into a linked list so that when a backend
requires a new page it can take one buffer from the freelist instead of

a) possibly touching quite some (I have seen 4 times *every* existing header)
pages to find one with usagecount = 0
b) having to write the page out itself

If everything is going well that would mean only the bgwritter (or if
bgfreelist or whatever) performs the clock sweep. Others take *new* pages from
the freelist instead of performing part of the sweep themselves.

Makes more sense?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Atri Sharma
On Tue, Apr 2, 2013 at 9:24 PM, Robert Haas robertmh...@gmail.com wrote:

 One thought I had for fiddling with usage_count is to make it grow
 additively (x = x + 1) and decay exponentially (x = x  1).  I'm not
 sure the idea is any good, but one problem with the current system is
 that it's pretty trivial for a buffer to accumulate five touches, and
 after that we lose all memory of what the frequency of access is, so a
 pages of varying different levels of hotness can all have usage
 count 5.  This might allow a little more refinement without letting
 the time to degrade the usage count get out of control.

This is just off the top of my head, but one possible solution could
be to quantize the levels of hotness. Specifically, we could
categorize buffers based on hotness. All buffers start in level 1 and
usage_count 0. Whichever buffer reaches usage_count of 5, and next
clock sweep which wants to increment its usage_count(hence taking it
above 5) sees that, it promotes the buffer to the next level, and
resets its usage_count to 0. Same logic applies for each level. When
we decrement usage_count and see that it is zero(for some buffer), if
it is in a level  1, we demote the buffer to the next lower level. If
the buffer is in level 1, it is a potential candidate for replacement.

This will allow us to have a loose idea about the hotness of a page,
without actually storing the usage_count for a buffer. We can still
update usage_count without locking, as buffers in high contention
which miss an update in their usage_count wont be affected by that
missed update, in accordance with all the discussion upthread.

Thoughts/Comments?

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Merlin Moncure
On Tue, Apr 2, 2013 at 12:50 PM, Atri Sharma atri.j...@gmail.com wrote:
 On Tue, Apr 2, 2013 at 9:24 PM, Robert Haas robertmh...@gmail.com wrote:

 One thought I had for fiddling with usage_count is to make it grow
 additively (x = x + 1) and decay exponentially (x = x  1).  I'm not
 sure the idea is any good, but one problem with the current system is
 that it's pretty trivial for a buffer to accumulate five touches, and
 after that we lose all memory of what the frequency of access is, so a
 pages of varying different levels of hotness can all have usage
 count 5.  This might allow a little more refinement without letting
 the time to degrade the usage count get out of control.

 This is just off the top of my head, but one possible solution could
 be to quantize the levels of hotness. Specifically, we could
 categorize buffers based on hotness. All buffers start in level 1 and
 usage_count 0. Whichever buffer reaches usage_count of 5, and next
 clock sweep which wants to increment its usage_count(hence taking it
 above 5) sees that, it promotes the buffer to the next level, and
 resets its usage_count to 0. Same logic applies for each level. When
 we decrement usage_count and see that it is zero(for some buffer), if
 it is in a level  1, we demote the buffer to the next lower level. If
 the buffer is in level 1, it is a potential candidate for replacement.

 This will allow us to have a loose idea about the hotness of a page,
 without actually storing the usage_count for a buffer. We can still
 update usage_count without locking, as buffers in high contention
 which miss an update in their usage_count wont be affected by that
 missed update, in accordance with all the discussion upthread.

how is that different from usage_count itself? usage_count *is* a
measure of hotness.  the arbitrary cap at 5 is paranoia to prevent the
already considerable damage that occurs in the situation Andres is
talking about (where everyhing is marked 'hot' so you have to sweep a
lot more).

also, any added complexity in terms of manipulating usage_count is a
move away from the lockless maintenance I'm proposing.  maybe my idea
is a non-starter on that basis alone, but the mechanic should be kept
as simple as possible.  the idea to move it to the bgwriter is to
pre-emptively do the work that backends are now doing: try and keep
ahead of the allocations being done so that buffer requests are
satisfied quickly.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-02 Thread Atri Sharma


Sent from my iPad

On 02-Apr-2013, at 23:41, Merlin Moncure mmonc...@gmail.com wrote:

 On Tue, Apr 2, 2013 at 12:50 PM, Atri Sharma atri.j...@gmail.com wrote:
 On Tue, Apr 2, 2013 at 9:24 PM, Robert Haas robertmh...@gmail.com wrote:
 
 One thought I had for fiddling with usage_count is to make it grow
 additively (x = x + 1) and decay exponentially (x = x  1).  I'm not
 sure the idea is any good, but one problem with the current system is
 that it's pretty trivial for a buffer to accumulate five touches, and
 after that we lose all memory of what the frequency of access is, so a
 pages of varying different levels of hotness can all have usage
 count 5.  This might allow a little more refinement without letting
 the time to degrade the usage count get out of control.
 
 This is just off the top of my head, but one possible solution could
 be to quantize the levels of hotness. Specifically, we could
 categorize buffers based on hotness. All buffers start in level 1 and
 usage_count 0. Whichever buffer reaches usage_count of 5, and next
 clock sweep which wants to increment its usage_count(hence taking it
 above 5) sees that, it promotes the buffer to the next level, and
 resets its usage_count to 0. Same logic applies for each level. When
 we decrement usage_count and see that it is zero(for some buffer), if
 it is in a level  1, we demote the buffer to the next lower level. If
 the buffer is in level 1, it is a potential candidate for replacement.
 
 This will allow us to have a loose idea about the hotness of a page,
 without actually storing the usage_count for a buffer. We can still
 update usage_count without locking, as buffers in high contention
 which miss an update in their usage_count wont be affected by that
 missed update, in accordance with all the discussion upthread.
 
 how is that different from usage_count itself? usage_count *is* a
 measure of hotness.  the arbitrary cap at 5 is paranoia to prevent the
 already considerable damage that occurs in the situation Andres is
 talking about (where everyhing is marked 'hot' so you have to sweep a
 lot more).
 
 also, any added complexity in terms of manipulating usage_count is a
 move away from the lockless maintenance I'm proposing.  maybe my idea
 is a non-starter on that basis alone, but the mechanic should be kept
 as simple as possible.  the idea to move it to the bgwriter is to
 pre-emptively do the work that backends are now doing: try and keep
 ahead of the allocations being done so that buffer requests are
 satisfied quickly.
 

I agree, we want to reduce the complexity of usage_count.I was only musing on 
the point Robert raised, if we want to continue using usage_count and refine it 
for more accurate tracking of hotness of a buffer.

Regards,

Atri

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Merlin Moncure
On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Friday, March 22, 2013, Ants Aasma wrote:

 On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
  well if you do a non-locking test first you could at least avoid some
  cases (and, if you get the answer wrong, so what?) by jumping to the
  next buffer immediately.  if the non locking test comes good, only
  then do you do a hardware TAS.
 
  you could in fact go further and dispense with all locking in front of
  usage_count, on the premise that it's only advisory and not a real
  refcount.  so you only then lock if/when it's time to select a
  candidate buffer, and only then when you did a non locking test first.
   this would of course require some amusing adjustments to various
  logical checks (usage_count = 0, heh).

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds) However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning.


 That is one of multiple issues.  Contention on the BufFreelistLock is
 another one.  I agree that usage_count maintenance is unlikely to become a
 bottleneck unless one or both of those is fixed first (and maybe not even
 then)

usage_count manipulation is not a bottleneck but that is irrelevant.
It can be affected by other page contention which can lead to priority
inversion.  I don't be believe there is any reasonable argument that
sitting and spinning while holding the BufFreelistLock is a good idea.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Robert Haas
On Mon, Apr 1, 2013 at 9:28 AM, Merlin Moncure mmonc...@gmail.com wrote:
 That is one of multiple issues.  Contention on the BufFreelistLock is
 another one.  I agree that usage_count maintenance is unlikely to become a
 bottleneck unless one or both of those is fixed first (and maybe not even
 then)

 usage_count manipulation is not a bottleneck but that is irrelevant.
 It can be affected by other page contention which can lead to priority
 inversion.  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

Hmm.  I'm not sure who if anyone I'm agreeing or disagreeing with, but
I think a big part of the reason why BufFreelistLock contention is
such a big problem is that we do other operations that involve atomics
while we're holding that lock.  You can have contention on a lock even
if you just take it, do some stuff, and release it.  But the longer
you hold the lock for, the less concurrency you need to have in order
to get a contention problem.  And atomics take a LOT longer to execute
than regular instructions - so it seems to me that usage_count
manipulation is relevant not so much because we get contention on the
buffer spinlocks as because it means we're sitting there serially
taking and releasing locks while sitting on BufFreelistLock.

In fact, BufFreelistLock is really misnamed, because for the most
part, the free list as we implement is going to be empty.  What the
BufFreelistLock is really doing is serializing the process of scanning
for a free buffer.  I think THAT is the problem.  If we could arrange
things so as to hold BufFreelistLock only for the amount of time
needed to remove a buffer from a freelist ... we'd probably buy
ourselves quite a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Merlin Moncure
On Mon, Apr 1, 2013 at 10:03 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Apr 1, 2013 at 9:28 AM, Merlin Moncure mmonc...@gmail.com wrote:
 That is one of multiple issues.  Contention on the BufFreelistLock is
 another one.  I agree that usage_count maintenance is unlikely to become a
 bottleneck unless one or both of those is fixed first (and maybe not even
 then)

 usage_count manipulation is not a bottleneck but that is irrelevant.
 It can be affected by other page contention which can lead to priority
 inversion.  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

 Hmm.  I'm not sure who if anyone I'm agreeing or disagreeing with, but
 I think a big part of the reason why BufFreelistLock contention is
 such a big problem is that we do other operations that involve atomics
 while we're holding that lock.  You can have contention on a lock even
 if you just take it, do some stuff, and release it.  But the longer
 you hold the lock for, the less concurrency you need to have in order
 to get a contention problem.  And atomics take a LOT longer to execute
 than regular instructions - so it seems to me that usage_count
 manipulation is relevant not so much because we get contention on the
 buffer spinlocks as because it means we're sitting there serially
 taking and releasing locks while sitting on BufFreelistLock.

 In fact, BufFreelistLock is really misnamed, because for the most
 part, the free list as we implement is going to be empty.  What the
 BufFreelistLock is really doing is serializing the process of scanning
 for a free buffer.  I think THAT is the problem.  If we could arrange
 things so as to hold BufFreelistLock only for the amount of time
 needed to remove a buffer from a freelist ... we'd probably buy
 ourselves quite a bit.

right.  I'm imaging a buffer scan loop that looks something like
(uncompiled, untested) this.  TryLockBufHdr does a simple TAS
without spin, returning the lock state (well, true if it acquired the
lock).  usage_count is specifically and deliberately adjusted without
having a lock on the buffer header (this would require some careful
testing and possible changes elsewhere):

  for (;;)
  {
buf = BufferDescriptors[StrategyControl-nextVictimBuffer];

if (++StrategyControl-nextVictimBuffer = NBuffers)
{
  StrategyControl-nextVictimBuffer = 0;
  StrategyControl-completePasses++;
}

/*
 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 * it; decrement the usage_count (unless pinned) and keep scanning.
 */
if (buf-refcount == 0)
{
  if (buf-usage_count  0)
  {
buf-usage_count--;
trycounter = NBuffers;
  }
  else
  {
if (TryLockBufHdr(buf)
{
  /* Found a usable buffer */
  if (strategy != NULL)
AddBufferToRing(strategy, buf);
  return buf;
}
  }
}

if (--trycounter == 0)
{
  /*
   * We've scanned all the buffers without making any state changes,
   * so all the buffers are pinned (or were when we looked at them).
   * We could hope that someone will free one eventually, but it's
   * probably better to fail than to risk getting stuck in an
   * infinite loop.
   */
  elog(ERROR, no unpinned buffers available);
}
  }

The advantages are:
*) no spinning under the free list lock (ever)
*) consequently pretty much zero chance of getting sheduled out
*) far less (in some cases drastically less) atomic operations unless
the sweep is generally returning a buffer immediately upon every
request, in which case it's about the same
*) less cpu cycles overall, unless you somehow miss a whole bunch of
buffers that claim locked when in fact they are not (which seems
improbable)

I think you could implement something approximating the above in
conjunction with your buffer nailing, or without (although your stuff
would likely reduce the number of cases where you'd really need it).
Ditto Jeff J's idea to completely replace BufFreeListLock with
spinlock implementation.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Bruce Momjian
On Mon, Apr  1, 2013 at 11:55:07AM -0500, Merlin Moncure wrote:
  In fact, BufFreelistLock is really misnamed, because for the most
  part, the free list as we implement is going to be empty.  What the
  BufFreelistLock is really doing is serializing the process of scanning
  for a free buffer.  I think THAT is the problem.  If we could arrange
  things so as to hold BufFreelistLock only for the amount of time
  needed to remove a buffer from a freelist ... we'd probably buy
  ourselves quite a bit.
 
 right.  I'm imaging a buffer scan loop that looks something like
 (uncompiled, untested) this.  TryLockBufHdr does a simple TAS
 without spin, returning the lock state (well, true if it acquired the
 lock).  usage_count is specifically and deliberately adjusted without
 having a lock on the buffer header (this would require some careful
 testing and possible changes elsewhere):

TAS does a CPU 'lock' instruction which affects the cpu cache.  Why not
just read the value with no lock?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Merlin Moncure
On Mon, Apr 1, 2013 at 3:32 PM, Bruce Momjian br...@momjian.us wrote:
 On Mon, Apr  1, 2013 at 11:55:07AM -0500, Merlin Moncure wrote:
  In fact, BufFreelistLock is really misnamed, because for the most
  part, the free list as we implement is going to be empty.  What the
  BufFreelistLock is really doing is serializing the process of scanning
  for a free buffer.  I think THAT is the problem.  If we could arrange
  things so as to hold BufFreelistLock only for the amount of time
  needed to remove a buffer from a freelist ... we'd probably buy
  ourselves quite a bit.

 right.  I'm imaging a buffer scan loop that looks something like
 (uncompiled, untested) this.  TryLockBufHdr does a simple TAS
 without spin, returning the lock state (well, true if it acquired the
 lock).  usage_count is specifically and deliberately adjusted without
 having a lock on the buffer header (this would require some careful
 testing and possible changes elsewhere):

 TAS does a CPU 'lock' instruction which affects the cpu cache.  Why not
 just read the value with no lock?

check again, that's exactly what it does. Note the old implementation
did a LockBufHdr() before examining refcount.  The key logic is here:


if (buf-refcount == 0)
{
  if (buf-usage_count  0)
  {
buf-usage_count--;
trycounter = NBuffers;
  }
  else
  {
if (TryLockBufHdr(buf)
{

So we do an unlocked read of refcount and immediately bail if the
buffer is locked according to the cpu cache.  Then we check (still
unlocked) usage_count and decrement it:  Our adjustment may be lost,
but so what?  Finally, we attempt one (and only one) cache line lock
(via TAS_SPIN) of the buffer and again bail if we see any problems
there.   Thus, it's impossible to get stuck in a potentially yielding
spin while holding the free list lock.

I dub this: The Frightened Turtle strategy of buffer allocation.
It's an idea based on my research trying to solve Vlad's issue of
having server stalls during read-only loads (see here:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
for a general backgrounder.  The idea may not actually fix his issue,
or there may be other aggravating aspects such as the
always-capricious linux scheduler, but I'm suspicious.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Andres Freund
On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
 On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  On Friday, March 22, 2013, Ants Aasma wrote:
 
  On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com
  wrote:
   well if you do a non-locking test first you could at least avoid some
   cases (and, if you get the answer wrong, so what?) by jumping to the
   next buffer immediately.  if the non locking test comes good, only
   then do you do a hardware TAS.
  
   you could in fact go further and dispense with all locking in front of
   usage_count, on the premise that it's only advisory and not a real
   refcount.  so you only then lock if/when it's time to select a
   candidate buffer, and only then when you did a non locking test first.
this would of course require some amusing adjustments to various
   logical checks (usage_count = 0, heh).
 
  Moreover, if the buffer happens to miss a decrement due to a data
  race, there's a good chance that the buffer is heavily used and
  wouldn't need to be evicted soon anyway. (if you arrange it to be a
  read-test-inc/dec-store operation then you will never go out of
  bounds) However, clocksweep and usage_count maintenance is not what is
  causing contention because that workload is distributed. The issue is
  pinning and unpinning.
 
 
  That is one of multiple issues.  Contention on the BufFreelistLock is
  another one.  I agree that usage_count maintenance is unlikely to become a
  bottleneck unless one or both of those is fixed first (and maybe not even
  then)
 
 usage_count manipulation is not a bottleneck but that is irrelevant.
 It can be affected by other page contention which can lead to priority
 inversion.  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

In my experience the mere fact of (unlockedly, but still) accessing all the
buffer headers can cause noticeable slowdowns in write only/mostly workloads 
with
big amounts of shmem.
Due to the write only nature large amounts of the buffers have a similar
usagecounts (since they are infrequently touched after the initial insertion)
and there are no free ones around so the search for a buffer frequently runs
through *all* buffer headers multiple times till it decremented all usagecounts
to 0. Then comes a period where free buffers are found easily (since all
usagecounts from the current sweep point onwards are zero). After that it
starts all over.
I now have seen that scenario multiple times :(

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Merlin Moncure
On Mon, Apr 1, 2013 at 4:09 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
 On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  On Friday, March 22, 2013, Ants Aasma wrote:
 
  On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com
  wrote:
   well if you do a non-locking test first you could at least avoid some
   cases (and, if you get the answer wrong, so what?) by jumping to the
   next buffer immediately.  if the non locking test comes good, only
   then do you do a hardware TAS.
  
   you could in fact go further and dispense with all locking in front of
   usage_count, on the premise that it's only advisory and not a real
   refcount.  so you only then lock if/when it's time to select a
   candidate buffer, and only then when you did a non locking test first.
this would of course require some amusing adjustments to various
   logical checks (usage_count = 0, heh).
 
  Moreover, if the buffer happens to miss a decrement due to a data
  race, there's a good chance that the buffer is heavily used and
  wouldn't need to be evicted soon anyway. (if you arrange it to be a
  read-test-inc/dec-store operation then you will never go out of
  bounds) However, clocksweep and usage_count maintenance is not what is
  causing contention because that workload is distributed. The issue is
  pinning and unpinning.
 
 
  That is one of multiple issues.  Contention on the BufFreelistLock is
  another one.  I agree that usage_count maintenance is unlikely to become a
  bottleneck unless one or both of those is fixed first (and maybe not even
  then)

 usage_count manipulation is not a bottleneck but that is irrelevant.
 It can be affected by other page contention which can lead to priority
 inversion.  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

 In my experience the mere fact of (unlockedly, but still) accessing all the
 buffer headers can cause noticeable slowdowns in write only/mostly workloads 
 with
 big amounts of shmem.
 Due to the write only nature large amounts of the buffers have a similar
 usagecounts (since they are infrequently touched after the initial insertion)
 and there are no free ones around so the search for a buffer frequently runs
 through *all* buffer headers multiple times till it decremented all 
 usagecounts
 to 0. Then comes a period where free buffers are found easily (since all
 usagecounts from the current sweep point onwards are zero). After that it
 starts all over.
 I now have seen that scenario multiple times :(

Interesting -- I was thinking about that too, but it's a separate
problem with a different trigger.  Maybe a bailout should be in there
so that after X usage_count adjustments the sweeper summarily does an
eviction, or maybe the max declines from 5 once per hundred buffers
inspected or some such.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Jim Nasby

On 3/23/13 7:41 AM, Ants Aasma wrote:

On Sat, Mar 23, 2013 at 6:04 AM, Jim Nasby j...@nasby.net wrote:

Partitioned clock sweep strikes me as a bad idea... you could certainly get
unlucky and end up with a lot of hot stuff in one partition.


Surely that is not worse than having everything in a single partition.
Given a decent partitioning function it's very highly unlikely to have
more than a few of the hottest buffers end up in a single partition.


One could argue that it is worse because you've added another layer of 
unpredictability to performance. If something happens to suddenly put two 
heavily hit sets in the same partition your previously good performance 
suddenly tanks.

Maybe that issue isn't real enough to be worth worrying about, but I still 
think it'd be easier and cleaner to try keeping stuff on the free list first...


Another idea that'sbeen broughht up inthe past is to have something in the
background keep a minimum number of buffers on the free list. That's how OS
VM systems I'm familiar with work, so there's precedent for it.

I recall there were at least some theoretical concerns about this, but I
don't remember if anyone actually tested the idea.


Yes, having bgwriter do the actual cleaning up seems like a good idea.
The whole bgwriter infrastructure will need some serious tuning. There
are many things that could be shifted to background if we knew it
could keep up, like hint bit setting on dirty buffers being flushed
out. But again, we have the issue of having good tests to see where
the changes hurt.


I think at some point we need to stop depending on just bgwriter for all this 
stuff. I believe it would be much cleaner if we had separate procs for 
everything we needed (although some synergies might exist; if we wanted to set 
hint bits during write then bgwriter *is* the logical place to put that).

In this case, I don't think keeping stuff on the free list is close enough to 
checkpoints that we'd want bgwriter to handle both. At most we might want them 
to pass some metrics back in forth.
--
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Jim Nasby

On 4/1/13 4:55 PM, Merlin Moncure wrote:

On Mon, Apr 1, 2013 at 4:09 PM, Andres Freundand...@2ndquadrant.com  wrote:

On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:

On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janesjeff.ja...@gmail.com  wrote:

 On Friday, March 22, 2013, Ants Aasma wrote:

 
 On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncuremmonc...@gmail.com
 wrote:

  well if you do a non-locking test first you could at least avoid some
  cases (and, if you get the answer wrong, so what?) by jumping to the
  next buffer immediately.  if the non locking test comes good, only
  then do you do a hardware TAS.
  
  you could in fact go further and dispense with all locking in front of
  usage_count, on the premise that it's only advisory and not a real
  refcount.  so you only then lock if/when it's time to select a
  candidate buffer, and only then when you did a non locking test first.
this would of course require some amusing adjustments to various
  logical checks (usage_count = 0, heh).

 
 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds) However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning.

 
 
 That is one of multiple issues.  Contention on the BufFreelistLock is
 another one.  I agree that usage_count maintenance is unlikely to become a
 bottleneck unless one or both of those is fixed first (and maybe not even
 then)


usage_count manipulation is not a bottleneck but that is irrelevant.
It can be affected by other page contention which can lead to priority
inversion.  I don't be believe there is any reasonable argument that
sitting and spinning while holding the BufFreelistLock is a good idea.


In my experience the mere fact of (unlockedly, but still) accessing all the
buffer headers can cause noticeable slowdowns in write only/mostly workloads 
with
big amounts of shmem.
Due to the write only nature large amounts of the buffers have a similar
usagecounts (since they are infrequently touched after the initial insertion)
and there are no free ones around so the search for a buffer frequently runs
through*all*  buffer headers multiple times till it decremented all usagecounts
to 0. Then comes a period where free buffers are found easily (since all
usagecounts from the current sweep point onwards are zero). After that it
starts all over.
I now have seen that scenario multiple times:(

Interesting -- I was thinking about that too, but it's a separate
problem with a different trigger.  Maybe a bailout should be in there
so that after X usage_count adjustments the sweeper summarily does an
eviction, or maybe the max declines from 5 once per hundred buffers
inspected or some such.


What's the potential downside on that though? IE: what happens if this scheme 
suddenly evicts the root page on a heavily used index? You'll suddenly have a 
ton of stuff blocked waiting for that page to come back in.

This is a use case that I think would benefit greatly from a background process 
that keeps pages in the free list.

That said, I now suspect that your frightened turtle approach would be of higher value 
than bgfreelist... but I suspect we'll ultimately want both of them for different 
reasons.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Jim Nasby

On 3/23/13 4:43 AM, Amit Kapila wrote:

I have tried one of the idea's : Adding the buffers background writer finds
reusable to freelist.
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
This can reduce the clock swipe as it can find buffers from freelist.


That's a nice potential efficiency gain, but it's not the same as having a separate bg 
process charged with keeping pages on the freelist. I believe a separate process would be 
useful in a wider variety of workloads, because it's not dependent on stumbling across 0 
count blocks; it would actively work to produce zero count blocks when none 
existed and then free-list them.


It shows performance improvement for read loads when data can be contained
in shared buffers,
but when the data becomes large and (I/O) is involved, it shows some dip as
well.


Do you remember off-hand why it slowed down with I/O so I don't have to read 
the whole thread? :) Was it just a matter of it evicting dirty pages sooner 
than it would otherwise?
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Atri Sharma
  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

I completely agree. The idea of spinning for a lock while already
inside a lock seems like a source of a hit to performance.

Regards,

Atri




--
Regards,

Atri
l'apprenant

On Mon, Apr 1, 2013 at 6:58 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Friday, March 22, 2013, Ants Aasma wrote:

 On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
  well if you do a non-locking test first you could at least avoid some
  cases (and, if you get the answer wrong, so what?) by jumping to the
  next buffer immediately.  if the non locking test comes good, only
  then do you do a hardware TAS.
 
  you could in fact go further and dispense with all locking in front of
  usage_count, on the premise that it's only advisory and not a real
  refcount.  so you only then lock if/when it's time to select a
  candidate buffer, and only then when you did a non locking test first.
   this would of course require some amusing adjustments to various
  logical checks (usage_count = 0, heh).

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds) However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning.


 That is one of multiple issues.  Contention on the BufFreelistLock is
 another one.  I agree that usage_count maintenance is unlikely to become a
 bottleneck unless one or both of those is fixed first (and maybe not even
 then)

 usage_count manipulation is not a bottleneck but that is irrelevant.
 It can be affected by other page contention which can lead to priority
 inversion.  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

 merlin



-- 
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Merlin Moncure
On Mon, Apr 1, 2013 at 6:09 PM, Jim Nasby j...@nasby.net wrote:
 On 4/1/13 4:55 PM, Merlin Moncure wrote:

 On Mon, Apr 1, 2013 at 4:09 PM, Andres Freundand...@2ndquadrant.com
 wrote:

 On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:

 On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janesjeff.ja...@gmail.com
  wrote:

  On Friday, March 22, 2013, Ants Aasma wrote:

  
  On Fri, Mar 22, 2013 at 10:22 PM, Merlin
   Moncuremmonc...@gmail.com
  wrote:

   well if you do a non-locking test first you could at least
avoid some
   cases (and, if you get the answer wrong, so what?) by jumping
to the
   next buffer immediately.  if the non locking test comes good,
only
   then do you do a hardware TAS.
   
   you could in fact go further and dispense with all locking in
front of
   usage_count, on the premise that it's only advisory and not a
real
   refcount.  so you only then lock if/when it's time to select a
   candidate buffer, and only then when you did a non locking
test first.
 this would of course require some amusing adjustments to
various
   logical checks (usage_count = 0, heh).

  
  Moreover, if the buffer happens to miss a decrement due to a data
  race, there's a good chance that the buffer is heavily used and
  wouldn't need to be evicted soon anyway. (if you arrange it to be
   a
  read-test-inc/dec-store operation then you will never go out of
  bounds) However, clocksweep and usage_count maintenance is not
   what is
  causing contention because that workload is distributed. The
   issue is
  pinning and unpinning.

  
  
  That is one of multiple issues.  Contention on the BufFreelistLock
   is
  another one.  I agree that usage_count maintenance is unlikely to
   become a
  bottleneck unless one or both of those is fixed first (and maybe
   not even
  then)

 
 usage_count manipulation is not a bottleneck but that is irrelevant.
 It can be affected by other page contention which can lead to priority
 inversion.  I don't be believe there is any reasonable argument that
 sitting and spinning while holding the BufFreelistLock is a good idea.

 
 In my experience the mere fact of (unlockedly, but still) accessing all
  the
 buffer headers can cause noticeable slowdowns in write only/mostly
  workloads with
 big amounts of shmem.
 Due to the write only nature large amounts of the buffers have a similar
 usagecounts (since they are infrequently touched after the initial
  insertion)
 and there are no free ones around so the search for a buffer frequently
  runs
 through*all*  buffer headers multiple times till it decremented all
  usagecounts

 to 0. Then comes a period where free buffers are found easily (since all
 usagecounts from the current sweep point onwards are zero). After that
  it
 starts all over.
 I now have seen that scenario multiple times:(

 Interesting -- I was thinking about that too, but it's a separate
 problem with a different trigger.  Maybe a bailout should be in there
 so that after X usage_count adjustments the sweeper summarily does an
 eviction, or maybe the max declines from 5 once per hundred buffers
 inspected or some such.

 What's the potential downside on that though? IE: what happens if this
 scheme suddenly evicts the root page on a heavily used index? You'll
 suddenly have a ton of stuff blocked waiting for that page to come back in.

That seems pretty unlikely because of A sheer luck of hitting that
page for the dropout (if your buffer count is N the chances of losing
it would seem to be 1/N at most) and B highly used pages are much more
likely to be pinned and thus immune from eviction.  But my issue with
this whole line of analysis is that I've never been able to to turn it
up in simulated testing.   Probably to do it you'd need very very fast
storage.

 This is a use case that I think would benefit greatly from a background
 process that keeps pages in the free list.

 That said, I now suspect that your frightened turtle approach would be of
 higher value than bgfreelist... but I suspect we'll ultimately want both
 of them for different reasons.

Well maybe.  Performance analysis of patches like this has to be
systematic and extremely thorough, so who knows what's the best way
forward? (hint: not me).  I'm hoping that adjusting clock sweep
behavior will shave a few cycles in uncontended workloads -- this will
make it a lot easier to sell performance wise.  But I'm optimistic and
maybe we can greatly reduce allocation contention without a lot of
work.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-01 Thread Amit Kapila


 -Original Message-
 From: Jim Nasby [mailto:j...@nasby.net]
 Sent: Tuesday, April 02, 2013 4:43 AM
 To: Amit Kapila
 Cc: 'Ants Aasma'; 'Merlin Moncure'; 'Tom Lane'; 'Atri Sharma'; 'Greg
 Stark'; 'PostgreSQL-development'
 Subject: Re: [HACKERS] Page replacement algorithm in buffer cache
 
 On 3/23/13 4:43 AM, Amit Kapila wrote:
  I have tried one of the idea's : Adding the buffers background writer
 finds
  reusable to freelist.
  http://www.postgresql.org/message-
 id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
  This can reduce the clock swipe as it can find buffers from freelist.
 
 That's a nice potential efficiency gain, but it's not the same as
 having a separate bg process charged with keeping pages on the
 freelist. I believe a separate process would be useful in a wider
 variety of workloads, because it's not dependent on stumbling across 0
 count blocks; it would actively work to produce zero count blocks
 when none existed and then free-list them.

If bg process work to produce zero count blocks, then wouldn't it give more
priority to
reduce usage count without any demand, or would you like to do the work in
bg process
based on some statistics for buffer usage.

  It shows performance improvement for read loads when data can be
 contained
  in shared buffers,
  but when the data becomes large and (I/O) is involved, it shows some
 dip as
  well.
 
 Do you remember off-hand why it slowed down with I/O so I don't have to
 read the whole thread? :) 

You can go through below mail to read some of my observation regarding the
performance improvement
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BC76A4
C@szxeml509-mbx

For case when I/O is involved you can once check the readings in below mail:
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BC76A4
C@szxeml509-mbx


 Was it just a matter of it evicting dirty pages sooner than it would
otherwise?

The exact reason was not nailed down, but as it is I/O heavy test, there is
some chance that OS scheduling also would have played role.
The reason I think why OS can be involved, because when we are flushing
buffers internally OS also might change their usage count based on access of
buffer.
The other could be as pointed out by you, early eviction.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-31 Thread Jeff Janes
On Friday, March 22, 2013, Ants Aasma wrote:

 On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure 
 mmonc...@gmail.comjavascript:;
 wrote:
  well if you do a non-locking test first you could at least avoid some
  cases (and, if you get the answer wrong, so what?) by jumping to the
  next buffer immediately.  if the non locking test comes good, only
  then do you do a hardware TAS.
 
  you could in fact go further and dispense with all locking in front of
  usage_count, on the premise that it's only advisory and not a real
  refcount.  so you only then lock if/when it's time to select a
  candidate buffer, and only then when you did a non locking test first.
   this would of course require some amusing adjustments to various
  logical checks (usage_count = 0, heh).

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds) However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning.


That is one of multiple issues.  Contention on the BufFreelistLock is
another one.  I agree that usage_count maintenance is unlikely to become a
bottleneck unless one or both of those is fixed first (and maybe not even
then)

...



 The issue with the current buffer management algorithm is that it
 seems to scale badly with increasing shared_buffers.


I do not think that this is the case.  Neither of the SELECT-only
contention points (pinning/unpinning of index root blocks when all data is
in shared_buffers, and BufFreelistLock when all data is not in
shared_buffers) are made worse by increasing shared_buffers that I have
seen.  They do scale badly with number of concurrent processes, though.

The reports of write-heavy workloads not scaling well with shared_buffers
do not seem to be driven by the buffer management algorithm, or at least
not the freelist part of it.  They mostly seem to center on the kernel and
the IO controllers.

 Cheers,

Jeff


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-26 Thread Bruce Momjian
On Fri, Mar 22, 2013 at 06:06:18PM +, Greg Stark wrote:
 On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  And we definitely looked at ARC
 
 We didn't just look at it. At least one release used it. Then patent
 issues were raised (and I think the implementation had some contention
 problems).

The problem was cache line overhead between CPUs to manage the ARC
queues.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-26 Thread Bruce Momjian
On Fri, Mar 22, 2013 at 04:16:18PM -0400, Tom Lane wrote:
 Merlin Moncure mmonc...@gmail.com writes:
  I think there is some very low hanging optimization fruit in the clock
  sweep loop.   first and foremost, I see no good reason why when
  scanning pages we have to spin and wait on a buffer in order to
  pedantically adjust usage_count.  some simple refactoring there could
  set it up so that a simple TAS (or even a TTAS with the first test in
  front of the cache line lock as we done automatically in x86 IIRC)
  could guard the buffer and, in the event of any lock detected, simply
  move on to the next candidate without messing around with that buffer
  at all.   This could construed as a 'trylock' variant of a spinlock
  and might help out with cases where an especially hot buffer is
  locking up the sweep.  This is exploiting the fact that from
  StrategyGetBuffer we don't need a *particular* buffer, just *a*
  buffer.
 
 Hm.  You could argue in fact that if there's contention for the buffer
 header, that's proof that it's busy and shouldn't have its usage count
 decremented.  So this seems okay from a logical standpoint.
 
 However, I'm not real sure that it's possible to do a conditional
 spinlock acquire that doesn't create just as much hardware-level
 contention as a full acquire (ie, TAS is about as bad whether it
 gets the lock or not).  So the actual benefit is a bit less clear.

Could we view the usage count, and if it is 5, and if there is someone
holding the lock, we just ignore the buffer and move on to the next
buffer?  Seems that could be done with no locking.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-26 Thread Merlin Moncure
On Tue, Mar 26, 2013 at 11:40 AM, Bruce Momjian br...@momjian.us wrote:
 On Fri, Mar 22, 2013 at 04:16:18PM -0400, Tom Lane wrote:
 Merlin Moncure mmonc...@gmail.com writes:
  I think there is some very low hanging optimization fruit in the clock
  sweep loop.   first and foremost, I see no good reason why when
  scanning pages we have to spin and wait on a buffer in order to
  pedantically adjust usage_count.  some simple refactoring there could
  set it up so that a simple TAS (or even a TTAS with the first test in
  front of the cache line lock as we done automatically in x86 IIRC)
  could guard the buffer and, in the event of any lock detected, simply
  move on to the next candidate without messing around with that buffer
  at all.   This could construed as a 'trylock' variant of a spinlock
  and might help out with cases where an especially hot buffer is
  locking up the sweep.  This is exploiting the fact that from
  StrategyGetBuffer we don't need a *particular* buffer, just *a*
  buffer.

 Hm.  You could argue in fact that if there's contention for the buffer
 header, that's proof that it's busy and shouldn't have its usage count
 decremented.  So this seems okay from a logical standpoint.

 However, I'm not real sure that it's possible to do a conditional
 spinlock acquire that doesn't create just as much hardware-level
 contention as a full acquire (ie, TAS is about as bad whether it
 gets the lock or not).  So the actual benefit is a bit less clear.

 Could we view the usage count, and if it is 5, and if there is someone
 holding the lock, we just ignore the buffer and move on to the next
 buffer?  Seems that could be done with no locking.

The idea is that if someone is holding the lock to completely ignore
the buffer regardless of usage.  Quotes there because we test the lock
without cacheline lock.  Now if the buffer is apparently unlocked but
returns locked once you *do* acquire cache line lock in anticipation
of refcounting, again immediately bail and go to next buffer.

I see no reason whatsoever to have buffer allocator spin and wait on a
blocked buffer.  This is like jumping onto a merry-go-round being spun
by sadistic teenagers.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Atri Sharma
 Perhaps this isn't the help you were looking for, but I spent a long time
 looking into this a few years ago.  Then I stopped and decided to work on
 other things.  I would recommend you do so too.

Agreed. It seems that my concerns were not valid, and since you have
already done some testing here, it further closes the matter.


 4) If most, but not quite all, of the highly-used data fits shared_buffers
 and shared_buffers takes most of RAM (or at least, most of RAM not already
 needed for other things like work_mem and executables), then the replacement
 policy matters a lot.  But different policies suit different work-loads, and
 there is little reason to think we can come up with a way to choose between
 them.  (Also, in these conditions, performance is very chaotic.  You can run
 the same algorithm for a long time, and it can suddenly switch from good to
 bad or the other way around, and then stay in that new mode for a long
 time).  Also, even if you come up with a good algorithm, if you make the
 data set 20% smaller or 20% larger, it is no longer a good algorithm.

Does that mean that an ideal, high performance postgres setup should
*never* set the shared_buffers to a large percentage of the system's
RAM? If so, have we ever encountered issues with low specs systems?


 5) Having buffers enter with usage_count=0 rather than 1 would probably be
 slightly better most of the time under conditions described in 4, but there
 is no way get enough evidence of this over enough conditions to justify
 making a change.  And besides, how often do people run with shared_buffers
 being most of RAM, and the hot data just barely not fitting in it?

Agreed.

 1) If all data fits in RAM but not shared_buffers, and you have a very large
 number of CPUs and a read-only or read-mostly workload, then BufFreelistLock
 can be a major bottle neck.  (But, on a Amazon high-CPU instance, I did not
 see this very much.  I suspect the degree of problem depends a lot on
 whether you have a lot of sockets with a few CPUs each, versus one chip with
 many CPUs).  This is very easy to come up with model cases for, pgbench -S
 -c8 -j8, for example, can often show it.

I will try that, thanks.

 2) A major reason that people run with shared_buffers much lower than RAM is
 that performance seems to suffer with shared_buffers  8GB under write-heavy
 workloads, even with spread-out checkpoints.  This is frequently reported as
 a real world problem, but as far as I know has never been reduced to a
 simple reproducible test case. (Although there was a recent thread, maybe
 High CPU usage / load average after upgrading to Ubuntu 12.04, that I
 thought might be relevant to this.  I haven't had the time to seriously
 study the thread, or the hardware to investigate it myself)


This seems interesting.Do we have some indications as to what the
problems could be?

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Atri Sharma

 I'll have to take a look.  Removing *all spinning* from from page
 allocation though feels like it might be worthwhile to test (got to
 give some bonus points for being a very local change and simple to
 implement).  I wonder if with more shared buffers you tend to sweep
 more buffers per allocation.  (IIRC Jeff J was skeptical of that).

Replacing the spinning with TAS for buffer header sounds like a good idea.

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Ants Aasma
On Sun, Mar 24, 2013 at 9:29 AM, Atri Sharma atri.j...@gmail.com wrote:
 I'll have to take a look.  Removing *all spinning* from from page
 allocation though feels like it might be worthwhile to test (got to
 give some bonus points for being a very local change and simple to
 implement).  I wonder if with more shared buffers you tend to sweep
 more buffers per allocation.  (IIRC Jeff J was skeptical of that).

 Replacing the spinning with TAS for buffer header sounds like a good idea.

Well, TAS is exactly what spinlocks are spinning on. Plain old
unlocked read-modify-write should be good enough for clock sweep usage
count update with the header lock taken only when we decide to try and
evict something. The data raciness will mean a higher or lower than
normal usage count when two an increment and decrement race each
other. The race is only likely for buffers with high contention. If we
use the value calculated locally to decide on eviction, the highly
used buffers where this is likely will get at least one clock sweep
cycle of grace time. If they are indeed highly used it's likely that
someone will manage to bump the usage_count in the meanwhile. If they
are not hot, a rare speedup or delay in eviction won't matter much.

Getting rid of memory barriers associated with locking in the clock
sweep will pipeline the loads and stores and so should bring on a good
performance increase for scanning large swathes of buffer headers.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Greg Smith

On 3/22/13 8:45 AM, Ants Aasma wrote:

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general.


Right.  The current page replacement method works as expected.  Many 
frequently accessed pages accumulate a usage count of 5 before the clock 
sweep hits them.  Pages that are accessed once and not again before the 
clock sweep are evicted.  There are several theoretically better ways to 
approach this.  Anyone who hasn't already been working on this for a few 
years is very unlikely to come up with a brand new idea, one that hasn't 
already been tested in the academic research.


But the real blocker here isn't ideas, it's creating benchmark workloads 
to validate any change.  Right now I see the most promising work that 
could lead toward the performance farm idea as all of the Jenkins 
based testing that's been going on recently.  Craig Ringer has using 
that for 2ndQuadrant work here, Peter Eisentraut has been working with 
it: 
http://petereisentraut.blogspot.com/2013/01/postgresql-and-jenkins.html 
and the PostGIS project uses it too.  There's some good momentum brewing 
there.


When we have regular performance testing with a mix of workloads--I have 
about 10 in mind to start--at that point we could start the testing 
performance changes to the buffer replacement.  Until then this whole 
area is hard to touch usefully.  You have to assume that any tuning you 
do for one type of workload might accidentally slow another.  Starting 
with a lot of baseline workloads is the only way to move usefully 
forward when facing that problem.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Atri Sharma
On Sun, Mar 24, 2013 at 6:11 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 3/22/13 8:45 AM, Ants Aasma wrote:

 However, I think the main issue isn't finding new algorithms that are
 better in some specific circumstances. The hard part is figuring out
 whether their performance is better in general.


 Right.  The current page replacement method works as expected.  Many
 frequently accessed pages accumulate a usage count of 5 before the clock
 sweep hits them.  Pages that are accessed once and not again before the
 clock sweep are evicted.  There are several theoretically better ways to
 approach this.  Anyone who hasn't already been working on this for a few
 years is very unlikely to come up with a brand new idea, one that hasn't
 already been tested in the academic research.

 But the real blocker here isn't ideas, it's creating benchmark workloads to
 validate any change.  Right now I see the most promising work that could
 lead toward the performance farm idea as all of the Jenkins based testing
 that's been going on recently.  Craig Ringer has using that for 2ndQuadrant
 work here, Peter Eisentraut has been working with it:
 http://petereisentraut.blogspot.com/2013/01/postgresql-and-jenkins.html and
 the PostGIS project uses it too.  There's some good momentum brewing there.

 When we have regular performance testing with a mix of workloads--I have
 about 10 in mind to start--at that point we could start the testing
 performance changes to the buffer replacement.  Until then this whole area
 is hard to touch usefully.  You have to assume that any tuning you do for
 one type of workload might accidentally slow another.  Starting with a lot
 of baseline workloads is the only way to move usefully forward when facing
 that problem.

Wow! Jenkins based performance farm should be just what we want.

Can we add tests for lock contentions(as discussed above), so that we
can move to some actual diagnostics?

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Atri Sharma
 If we
 use the value calculated locally to decide on eviction, the highly
 used buffers where this is likely will get at least one clock sweep
 cycle of grace time. If they are indeed highly used it's likely that
 someone will manage to bump the usage_count in the meanwhile. If they
 are not hot, a rare speedup or delay in eviction won't matter much.

Yeah, a buffer page getting an extra clock sweep life in lieu of
potential performance improvement isn't much of a cost to pay.

So, essentially, we decide locally which page to evict, then try to
get a lock on the header only when we want to evict the victim page?
Sounds like the contention for the header should go down considerably.

Unlocked incrementing/decrementing of USAGE_COUNT may lead to the
values of USAGE_COUNT differing from the actual value they should be
having by 1 or 2 counts in case of high contention buffer pages, which
shouldn't matter, as they are in high contention. I agree, it is
likely that a process(s) will increase the usage_count anyways.

 Getting rid of memory barriers associated with locking in the clock
 sweep will pipeline the loads and stores and so should bring on a good
 performance increase for scanning large swathes of buffer headers.

This does sound interesting. If the Jenkins based performance farm
gets ready, we can do some tests and see the kind of results we get.

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-24 Thread Ants Aasma
On Sun, Mar 24, 2013 at 7:03 PM, Atri Sharma atri.j...@gmail.com wrote:
 So, essentially, we decide locally which page to evict, then try to
 get a lock on the header only when we want to evict the victim page?
 Sounds like the contention for the header should go down considerably.

Not exactly locally, the idea is still to use the shared buffer header
structures. We just accept any errors in usage_count coming from data
races. This won't solve buffer header contention as modifying
usage_count still brings the cacheline in an exclusive state. It does
help with reducing the time BufferFreeListLock is held and it
eliminates spinning on the clocksweep side so only queries that access
the contended buffer are hurt by spinning.

 Unlocked incrementing/decrementing of USAGE_COUNT may lead to the
 values of USAGE_COUNT differing from the actual value they should be
 having by 1 or 2 counts in case of high contention buffer pages, which
 shouldn't matter, as they are in high contention. I agree, it is
 likely that a process(s) will increase the usage_count anyways.

There is no point in an unlocked increment as this is done in
conjunction with a buffer pin that requires the header spinlock
anyway.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Jim Nasby

On 3/22/13 7:27 PM, Ants Aasma wrote:

On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote:

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately.  if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount.  so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
logical checks (usage_count = 0, heh).


Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf


Partitioned clock sweep strikes me as a bad idea... you could certainly get 
unlucky and end up with a lot of hot stuff in one partition.

Another idea that'sbeen broughht up inthe past is to have something in the 
background keep a minimum number of buffers on the free list. That's how OS VM 
systems I'm familiar with work, so there's precedent for it.

I recall there were at least some theoretical concerns about this, but I don't 
remember if anyone actually tested the idea.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Amit Kapila
On Saturday, March 23, 2013 9:34 AM Jim Nasby wrote:
 On 3/22/13 7:27 PM, Ants Aasma wrote:
  On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
 
  One other interesting idea I have seen is closeable scalable nonzero
  indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
  use a tree structure to dynamically stripe access to the shared lock
  counter when contention is detected. Downside is that considerable
  amount of shared memory is needed so there needs to be some way to
  limit the resource usage. This is actually somewhat isomorphic to the
  nailing idea.
 
  The issue with the current buffer management algorithm is that it
  seems to scale badly with increasing shared_buffers. I think the
  improvements should concentrate on finding out what is the problem
  there and figuring out how to fix it. A simple idea to test would be
  to just partition shared buffers along with the whole clock sweep
  machinery into smaller ones, like the buffer mapping hash tables
  already are. This should at the very least reduce contention for the
  clock sweep even if it doesn't reduce work done per page miss.
 
  [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf
 
 Partitioned clock sweep strikes me as a bad idea... you could certainly
 get unlucky and end up with a lot of hot stuff in one partition.
 
 Another idea that'sbeen broughht up inthe past is to have something in
 the background keep a minimum number of buffers on the free list.
 That's how OS VM systems I'm familiar with work, so there's precedent
 for it.
 
 I recall there were at least some theoretical concerns about this, but
 I don't remember if anyone actually tested the idea.

I have tried one of the idea's : Adding the buffers background writer finds
reusable to freelist. 
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF9
7@szxeml509-mbs
This can reduce the clock swipe as it can find buffers from freelist. 

It shows performance improvement for read loads when data can be contained
in shared buffers, 
but when the data becomes large and (I/O) is involved, it shows some dip as
well.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Ants Aasma
On Sat, Mar 23, 2013 at 6:29 AM, Atri Sharma atri.j...@gmail.com wrote:
 One way to distribute memory contention in case of spinlocks could be
 to utilize the fundamentals of NUMA architecture. Specifically, we can
 let the contending backends spin on local flags instead on the buffer
 header flags directly. As access to local cache lines is much cheaper
 and faster than memory locations which are far away in NUMA, we could
 potentially reduce the memory overhead for a specific line and reduce
 the overall overheads as well.

This is not even something for NUMA architectures (which is by now all
multiprocessor machines), even multicore machines have overheads for
bouncing cache lines. The locks don't even have to be local, it's good
enough to just have better probability of each backend contending
hitting a different lock, if we take care of not having the locks
share cache lines. IMO that's the whole point of striping locks, the
critical section is usually cheaper than the cost of getting the cache
line in an exclusive state.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Ants Aasma
On Sat, Mar 23, 2013 at 6:04 AM, Jim Nasby j...@nasby.net wrote:
 Partitioned clock sweep strikes me as a bad idea... you could certainly get
 unlucky and end up with a lot of hot stuff in one partition.

Surely that is not worse than having everything in a single partition.
Given a decent partitioning function it's very highly unlikely to have
more than a few of the hottest buffers end up in a single partition.

 Another idea that'sbeen broughht up inthe past is to have something in the
 background keep a minimum number of buffers on the free list. That's how OS
 VM systems I'm familiar with work, so there's precedent for it.

 I recall there were at least some theoretical concerns about this, but I
 don't remember if anyone actually tested the idea.

Yes, having bgwriter do the actual cleaning up seems like a good idea.
The whole bgwriter infrastructure will need some serious tuning. There
are many things that could be shifted to background if we knew it
could keep up, like hint bit setting on dirty buffers being flushed
out. But again, we have the issue of having good tests to see where
the changes hurt.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Atri Sharma


 Partitioned clock sweep strikes me as a bad idea... you could certainly get
 unlucky and end up with a lot of hot stuff in one partition.

 Another idea that'sbeen broughht up inthe past is to have something in the
 background keep a minimum number of buffers on the free list. That's how OS
 VM systems I'm familiar with work, so there's precedent for it.

 I recall there were at least some theoretical concerns about this, but I
 don't remember if anyone actually tested the idea.

 One way to handle this could be to have dynamic membership of pages
in the partitions. Based on activity for a page, it could be moved to
another partition. In this manner, we *could* distribute the hot and
not so hot buffer pages and hence it could help.

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Jeff Janes
On Thu, Mar 21, 2013 at 9:51 PM, Atri Sharma atri.j...@gmail.com wrote:

 Hello all,

 Sorry if this is a naive question.

 I was going through Greg Smith's slides on buffer
 cache(
 http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
 When going through the page replacement algorithm that we use i.e.
 clocksweep algorithm, I felt a potential problem in our current
 system.

 Specifically, when a new entry is allocated in the buffer, it's
 USAGE_COUNT is set to 1. On each sweep of the algorithm, the
 USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
 zero is replaced.


It is replaced when the usage_count is already found to be zero, not when
it is made zero.


 I feel that this could lead to a bias towards replacement of
 relatively younger pages in the  cache over older pages. An entry
 which has just entered the cache with USAGE_COUNT=1 could be replaced
 soon, but it may be needed frequently in the near future,


Well, it may be needed.  But then again, it may not be needed.  And that
old page, that also may be needed frequently in the future (which is far
more likely than a new page--after all an old page likely got old for a
reason).  The best evidence that it will be needed again is that it
actually has been needed again.

I'm more convinced in the other direction, new pages should enter with 0
rather than with 1.  I think that the argument that a new buffer needs to
be given more of an opportunity to get used again is mostly bogus.  You
cannot bully the shared_buffers into being larger than it is.  If all the
incoming buffers get more opportunity, that just means the buffer-clock
ticks twice as fast, and really none of them has more opportunity when you
measure that opportunity against an outside standard (wall time, or
work-load accomplished).  All of our children cannot be above average.

Cheers,

Jeff


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 7:27 PM, Ants Aasma a...@cybertec.at wrote:
 On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote:
 well if you do a non-locking test first you could at least avoid some
 cases (and, if you get the answer wrong, so what?) by jumping to the
 next buffer immediately.  if the non locking test comes good, only
 then do you do a hardware TAS.

 you could in fact go further and dispense with all locking in front of
 usage_count, on the premise that it's only advisory and not a real
 refcount.  so you only then lock if/when it's time to select a
 candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
 logical checks (usage_count = 0, heh).

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds)

yeah. There's something to be said to have an upper bound in the
length of time to get a page out (except in the special case when most
of them are pinned).  Right now, any page contention on a buffer
header for any reason can shut down buffer allocation, and that's just
not good.  It's obviously not very likely to happen but I think it can
does does happen.  The more I think about it the more I think's a bad
idea to spin during buffer allocation for any reason, ever.

 However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning. There we need an accurate count and there are
 some pages like index roots that get hit very heavily. Things to do
 there would be in my opinion convert to a futex based spinlock so when
 there is contention it doesn't completely kill performance and then
 try to get rid of the contention. Converting to lock-free pinning
 won't help much here as what is killing us here is the cacheline
 bouncing.

Yup -- futexes are another way to go.  They are linux specific though.

 One way to get rid of contention is the buffer nailing idea that
 Robert came up with. If some buffer gets so hot that maintaining
 refcount on the buffer header leads to contention, promote that buffer
 to a nailed status, let everyone keep their pin counts locally and
 sometime later revisit the nailing decision and if necessary convert
 pins back to the buffer header.

Yeah this is a more general (albeit more complicated) solution and
would likely be fantastic.  Is it safe to assume that refcounting is
the only likely cause of contention?

 One other interesting idea I have seen is closeable scalable nonzero
 indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
 use a tree structure to dynamically stripe access to the shared lock
 counter when contention is detected. Downside is that considerable
 amount of shared memory is needed so there needs to be some way to
 limit the resource usage. This is actually somewhat isomorphic to the
 nailing idea.

 The issue with the current buffer management algorithm is that it
 seems to scale badly with increasing shared_buffers. I think the
 improvements should concentrate on finding out what is the problem
 there and figuring out how to fix it. A simple idea to test would be
 to just partition shared buffers along with the whole clock sweep
 machinery into smaller ones, like the buffer mapping hash tables
 already are. This should at the very least reduce contention for the
 clock sweep even if it doesn't reduce work done per page miss.

 [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

I'll have to take a look.  Removing *all spinning* from from page
allocation though feels like it might be worthwhile to test (got to
give some bonus points for being a very local change and simple to
implement).  I wonder if with more shared buffers you tend to sweep
more buffers per allocation.  (IIRC Jeff J was skeptical of that).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 I'm more convinced in the other direction, new pages should enter with 0
 rather than with 1.  I think that the argument that a new buffer needs to
 be given more of an opportunity to get used again is mostly bogus.

IIRC, the argument for starting at 1 not 0 is that otherwise a new page
might have an infinitesmally small lifespan, if the clock sweep should
reach it just after it gets entered into the buffers.  By starting at
1, the uncertainty in a new page's lifespan runs from 1 to 2 sweep times
not 0 to 1 sweep time.

I think though that this argument only holds water if the buffer didn't
get found via the clock sweep to start with --- otherwise, it ought to
have just about one clock sweep of time before the sweep comes back to
it.  It does apply to buffers coming off the freelist, though.

Thus, if we were to get rid of the freelist then maybe we could change
the starting usage_count ... but whether that's a good idea in itself
is pretty uncertain.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-23 Thread Jeff Janes
On Fri, Mar 22, 2013 at 4:06 AM, Atri Sharma atri.j...@gmail.com wrote:




 Not yet, I figured this might be a problem and am designing test cases
 for the same. I would be glad for some help there please.


Perhaps this isn't the help you were looking for, but I spent a long time
looking into this a few years ago.  Then I stopped and decided to work on
other things.  I would recommend you do so too.

If I have to struggle to come up with an artificial test case that shows
that there is a problem, then why should I believe that there actually is a
problem?  If you take a well-known problem (like, say, bad performance at
shared_buffers  8GB (or even lower, on Windows)) and create an artificial
test case to exercise and investigate that, that is one thing.  But why
invent pathological test cases with no known correspondence to reality?
 There are plenty of real problems to work on, and some of them are just as
intellectually interesting as the artificial problems are.

My conclusions were:

1) If everything fits in shared_buffers, then the replacement policy
doesn't matter.

2) If shared_buffers is much smaller than RAM (the most common case, I
believe), then what mostly matters is your OS's replacement policy, not
pgsql's.  Not much a pgsql hacker can do about this, other than turn into a
kernel hacker.

3) If little of the highly-used data fits in RAM. then any non-absurd
replacement policy is about as good as any other non-absurd one.

4) If most, but not quite all, of the highly-used data fits shared_buffers
and shared_buffers takes most of RAM (or at least, most of RAM not already
needed for other things like work_mem and executables), then the
replacement policy matters a lot.  But different policies suit different
work-loads, and there is little reason to think we can come up with a way
to choose between them.  (Also, in these conditions, performance is very
chaotic.  You can run the same algorithm for a long time, and it can
suddenly switch from good to bad or the other way around, and then stay in
that new mode for a long time).  Also, even if you come up with a good
algorithm, if you make the data set 20% smaller or 20% larger, it is no
longer a good algorithm.

5) Having buffers enter with usage_count=0 rather than 1 would probably be
slightly better most of the time under conditions described in 4, but there
is no way get enough evidence of this over enough conditions to justify
making a change.  And besides, how often do people run with shared_buffers
being most of RAM, and the hot data just barely not fitting in it?


If you want some known problems that are in this general area, we have:

1) If all data fits in RAM but not shared_buffers, and you have a very
large number of CPUs and a read-only or read-mostly workload,
then BufFreelistLock can be a major bottle neck.  (But, on a Amazon
high-CPU instance, I did not see this very much.  I suspect the degree of
problem depends a lot on whether you have a lot of sockets with a few CPUs
each, versus one chip with many CPUs).  This is very easy to come up with
model cases for, pgbench -S -c8 -j8, for example, can often show it.

2) A major reason that people run with shared_buffers much lower than RAM
is that performance seems to suffer with shared_buffers  8GB under
write-heavy workloads, even with spread-out checkpoints.  This is
frequently reported as a real world problem, but as far as I know has never
been reduced to a simple reproducible test case. (Although there was a
recent thread, maybe High CPU usage / load average after upgrading to
Ubuntu 12.04, that I thought might be relevant to this.  I haven't had the
time to seriously study the thread, or the hardware to investigate it
myself)

Cheers,

Jeff


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma


Sent from my iPad

On 22-Mar-2013, at 11:28, Amit Kapila amit.kap...@huawei.com wrote:

 On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
 Hello all,
 
 Sorry if this is a naive question.
 
 I was going through Greg Smith's slides on buffer
 cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
 he.pdf).
 When going through the page replacement algorithm that we use i.e.
 clocksweep algorithm, I felt a potential problem in our current
 system.
 
 Specifically, when a new entry is allocated in the buffer, it's
 USAGE_COUNT is set to 1. On each sweep of the algorithm, the
 USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
 zero is replaced.
 
 Yes, it is replaced but in the next clock sweep pass, not immediately after
 making 0.
 So till the time of next pass if nobody accesses the buffer and all other
 buffers have higher count, it can be replaced.
 Also the buffer, it has returned for which the usage count becomes 1, it
 will come to reduce the usage count only in next pass.
 So in whole, I think it needs 2 passes for a freshly returned buffer to be
 re-used incase no one uses it again.
 
 With Regards,
 Amit Kapila.
 

Hmm,so in the second pass,it gets replaced,right?

I think that if the initialization of USAGE_COUNT starts at the maximum allowed 
value instead of one, we can have a better solution to this problem.

Another,more complex solution could be to introduce an ageing factor as well.

Regards,

Atri

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 12:00 PM Atri Sharma wrote:
 
 
 Sent from my iPad
 
 On 22-Mar-2013, at 11:28, Amit Kapila amit.kap...@huawei.com wrote:
 
  On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
  Hello all,
 
  Sorry if this is a naive question.
 
  I was going through Greg Smith's slides on buffer
 
 cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
  he.pdf).
  When going through the page replacement algorithm that we use i.e.
  clocksweep algorithm, I felt a potential problem in our current
  system.
 
  Specifically, when a new entry is allocated in the buffer, it's
  USAGE_COUNT is set to 1. On each sweep of the algorithm, the
  USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
  zero is replaced.
 
  Yes, it is replaced but in the next clock sweep pass, not immediately
 after
  making 0.
  So till the time of next pass if nobody accesses the buffer and all
 other
  buffers have higher count, it can be replaced.
  Also the buffer, it has returned for which the usage count becomes 1,
 it
  will come to reduce the usage count only in next pass.
  So in whole, I think it needs 2 passes for a freshly returned buffer
 to be
  re-used incase no one uses it again.
 
  With Regards,
  Amit Kapila.
 
 
 Hmm,so in the second pass,it gets replaced,right?
  Yes.


 I think that if the initialization of USAGE_COUNT starts at the maximum
 allowed value instead of one, we can have a better solution to this
 problem.

So what is your idea, if you start at maximum, what we will do for further
accesses to it?
Why do you want to give more priority to just loaded page?


 Another,more complex solution could be to introduce an ageing factor as
 well.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 I think that if the initialization of USAGE_COUNT starts at the maximum
 allowed value instead of one, we can have a better solution to this
 problem.

 So what is your idea, if you start at maximum, what we will do for further
 accesses to it?

I havent chalked out a detailed plan yet, but I think the idea of
initializing USAGE_COUNT to maximum value is not at all good. I was
just thinking off the top of my head.

 Why do you want to give more priority to just loaded page?

I just want it to have more chances to stay, rather than being
replaced pretty early. This is because,  as I said earlier, a new page
could be in high demand in near future, which would lead to repeated
replacement-bringing in of page and hence cause overheads.




 Another,more complex solution could be to introduce an aging factor

This is the one I think would work out best, add an age factor as to
the time duration which an entry has spent in the cache along with its
usage count.

So, what I am proposing here is to add another factor in the
clocksweep algorithm when it selects victim pages for replacement.
Specifically, the selection of victim pages should be done with the
usage_count AND the time spent by the entry in the cache. This would
give priority to pages with high accesses and not ignore relatively
young pages as well. If a page is not accessed for a long time after
it was allocated, it would be the ideal victim for replacement both in
terms of USAGE_COUNT as well as age.

Regards,

Atri




--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 4:16 PM Atri Sharma wrote:
 
  I think that if the initialization of USAGE_COUNT starts at the
 maximum
  allowed value instead of one, we can have a better solution to this
  problem.
 
  So what is your idea, if you start at maximum, what we will do for
 further
  accesses to it?
 
 I havent chalked out a detailed plan yet, but I think the idea of
 initializing USAGE_COUNT to maximum value is not at all good. I was
 just thinking off the top of my head.
 
  Why do you want to give more priority to just loaded page?
 
 I just want it to have more chances to stay, rather than being
 replaced pretty early. This is because,  as I said earlier, a new page
 could be in high demand in near future, which would lead to repeated
 replacement-bringing in of page and hence cause overheads.
 
 
 
 
  Another,more complex solution could be to introduce an aging factor
 
 This is the one I think would work out best, add an age factor as to
 the time duration which an entry has spent in the cache along with its
 usage count.
 
 So, what I am proposing here is to add another factor in the
 clocksweep algorithm when it selects victim pages for replacement.
 Specifically, the selection of victim pages should be done with the
 usage_count AND the time spent by the entry in the cache. This would
 give priority to pages with high accesses and not ignore relatively
 young pages as well. If a page is not accessed for a long time after
 it was allocated, it would be the ideal victim for replacement both in
 terms of USAGE_COUNT as well as age.

What would you do if the only young page has usage count zero during second
sweep.
I don't think introducing another factor along with usage count would do any
much help. 
Have you encountered any such workload very relatively young pages are
getting victimized
and the same is causing performance issues?

With Regards,
Amit Kapila.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 What would you do if the only young page has usage count zero during second
 sweep.

UmmThe same approach we take when there is no page with usage
count zero in a sweep in the current algorithm?


 I don't think introducing another factor along with usage count would do any
 much help.

Which else approach can we take here?


 Have you encountered any such workload very relatively young pages are
 getting victimized
 and the same is causing performance issues?

Not yet, I figured this might be a problem and am designing test cases
for the same. I would be glad for some help there please.

Regards,

Atri




--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Amit Kapila
On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:
 
  What would you do if the only young page has usage count zero during
 second
  sweep.
 
 UmmThe same approach we take when there is no page with usage
 count zero in a sweep in the current algorithm?

It would give more priority to young page as compare to more used page.
I don't know if that would be correct thing to do.

With Regards,
Amit Kapila.
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma
On Fri, Mar 22, 2013 at 4:53 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:
 
  What would you do if the only young page has usage count zero during
 second
  sweep.

 UmmThe same approach we take when there is no page with usage
 count zero in a sweep in the current algorithm?

 It would give more priority to young page as compare to more used page.
 I don't know if that would be correct thing to do.

 This is my idea, give equal priority to new pages when they enter the
cache, so that they all have an equal chance to be replaced initially.
With time, usage_count shall become the deciding factor in victim
selection.

Regards,

Atri



--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Ants Aasma
On Mar 22, 2013 12:46 PM, Atri Sharma atri.j...@gmail.com wrote:

 This is the one I think would work out best, add an age factor as to
 the time duration which an entry has spent in the cache along with its
 usage count.

You might want to check out the LIRS cache replacement algorithm [1].
That algorithm tries to estimate least frequently used instead of
least recently used. Mysql uses it for their buffer replacement
policy. There is also a clock sweep based approximation called
CLOCK-Pro. Papers describing and evaluating both are available on the
net. The evaluations in the papers showed significantly better
performance for both of those compared to regular clock sweep or even
ARC.

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general. My idea was to create
a patch to capture page pinning traffic from PostgreSQL (maybe stream
out into a per backend file), run it with some production workloads
and use that to generate testing workloads for the cache replacement
policies. Haven't gotten round to actually doing that though.

[1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm

Regards,
Ants Aasma
--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 However, I think the main issue isn't finding new algorithms that are
 better in some specific circumstances. The hard part is figuring out
 whether their performance is better in general. My idea was to create
 a patch to capture page pinning traffic from PostgreSQL (maybe stream
 out into a per backend file), run it with some production workloads
 and use that to generate testing workloads for the cache replacement
 policies. Haven't gotten round to actually doing that though.

 [1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm


Thanks for the link. I think LIRS can indeed be helpful in our case.

We should indeed build some test cases for testing this theory. I am
all for capturing page replacement and usage data and analyzing it.

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Tom Lane
Ants Aasma a...@cybertec.at writes:
 You might want to check out the LIRS cache replacement algorithm [1].
 That algorithm tries to estimate least frequently used instead of
 least recently used. Mysql uses it for their buffer replacement
 policy. There is also a clock sweep based approximation called
 CLOCK-Pro. Papers describing and evaluating both are available on the
 net. The evaluations in the papers showed significantly better
 performance for both of those compared to regular clock sweep or even
 ARC.

I seem to recall that CLOCK-Pro, or something named similarly to that,
was one of the alternatives discussed when we went over to the current
clock-sweep approach.  And we definitely looked at ARC.  It might be
worth checking the archives from back then to see what's already been
considered.

 However, I think the main issue isn't finding new algorithms that are
 better in some specific circumstances. The hard part is figuring out
 whether their performance is better in general.

Yeah. You can prove almost anything with the right set of test cases :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Greg Stark
On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And we definitely looked at ARC

We didn't just look at it. At least one release used it. Then patent
issues were raised (and I think the implementation had some contention
problems).


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma
On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And we definitely looked at ARC

 We didn't just look at it. At least one release used it. Then patent
 issues were raised (and I think the implementation had some contention
 problems).


 --
 greg

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote:
 On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 And we definitely looked at ARC

 We didn't just look at it. At least one release used it. Then patent
 issues were raised (and I think the implementation had some contention
 problems).


 --
 greg

 What is the general thinking? Is it time to start testing again and
 thinking about improvements to the current algorithm?

well, what problem are you trying to solve exactly?  the main problems
I see today are not so much in terms of page replacement but spinlock
and lwlock contention.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote:
 What is the general thinking? Is it time to start testing again and
 thinking about improvements to the current algorithm?

 well, what problem are you trying to solve exactly?  the main problems
 I see today are not so much in terms of page replacement but spinlock
 and lwlock contention.

Even back when we last hacked on that algorithm, the concerns were not
so much about which pages it replaced as how much overhead and
contention was created by the management algorithm.  I haven't seen any
reason to think we have a problem with the quality of the replacement
choices.  The proposal to increase the initial usage count would
definitely lead to more overhead/contention, though, because it would
result in having to circle around all the buffers more times (on
average) to get a free buffer.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 2:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma atri.j...@gmail.com wrote:
 What is the general thinking? Is it time to start testing again and
 thinking about improvements to the current algorithm?

 well, what problem are you trying to solve exactly?  the main problems
 I see today are not so much in terms of page replacement but spinlock
 and lwlock contention.

 Even back when we last hacked on that algorithm, the concerns were not
 so much about which pages it replaced as how much overhead and
 contention was created by the management algorithm.  I haven't seen any
 reason to think we have a problem with the quality of the replacement
 choices.  The proposal to increase the initial usage count would
 definitely lead to more overhead/contention, though, because it would
 result in having to circle around all the buffers more times (on
 average) to get a free buffer.


yup...absolutely.  I have a hunch that the occasional gripes we see
about server stalls under high load with read only (or mostly read
only) loads are coming from spinlock contention under the lwlock
hitting a critical point and shutting the server down effectively
until by chance the backend with the lwlock gets lucky and lands the
spinlock.

I think there is some very low hanging optimization fruit in the clock
sweep loop.   first and foremost, I see no good reason why when
scanning pages we have to spin and wait on a buffer in order to
pedantically adjust usage_count.  some simple refactoring there could
set it up so that a simple TAS (or even a TTAS with the first test in
front of the cache line lock as we done automatically in x86 IIRC)
could guard the buffer and, in the event of any lock detected, simply
move on to the next candidate without messing around with that buffer
at all.   This could construed as a 'trylock' variant of a spinlock
and might help out with cases where an especially hot buffer is
locking up the sweep.  This is exploiting the fact that from
StrategyGetBuffer we don't need a *particular* buffer, just *a*
buffer.

I also wonder if we shouldn't (perhaps in addition to the above)
resuscitate Jeff Jane's idea to get rid of the lwlock completely and
manage everything with spinlocks..

Naturally, all of this would have to be confirmed with some very robust testing.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 I think there is some very low hanging optimization fruit in the clock
 sweep loop.   first and foremost, I see no good reason why when
 scanning pages we have to spin and wait on a buffer in order to
 pedantically adjust usage_count.  some simple refactoring there could
 set it up so that a simple TAS (or even a TTAS with the first test in
 front of the cache line lock as we done automatically in x86 IIRC)
 could guard the buffer and, in the event of any lock detected, simply
 move on to the next candidate without messing around with that buffer
 at all.   This could construed as a 'trylock' variant of a spinlock
 and might help out with cases where an especially hot buffer is
 locking up the sweep.  This is exploiting the fact that from
 StrategyGetBuffer we don't need a *particular* buffer, just *a*
 buffer.

Hm.  You could argue in fact that if there's contention for the buffer
header, that's proof that it's busy and shouldn't have its usage count
decremented.  So this seems okay from a logical standpoint.

However, I'm not real sure that it's possible to do a conditional
spinlock acquire that doesn't create just as much hardware-level
contention as a full acquire (ie, TAS is about as bad whether it
gets the lock or not).  So the actual benefit is a bit less clear.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Merlin Moncure
On Fri, Mar 22, 2013 at 3:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 I think there is some very low hanging optimization fruit in the clock
 sweep loop.   first and foremost, I see no good reason why when
 scanning pages we have to spin and wait on a buffer in order to
 pedantically adjust usage_count.  some simple refactoring there could
 set it up so that a simple TAS (or even a TTAS with the first test in
 front of the cache line lock as we done automatically in x86 IIRC)
 could guard the buffer and, in the event of any lock detected, simply
 move on to the next candidate without messing around with that buffer
 at all.   This could construed as a 'trylock' variant of a spinlock
 and might help out with cases where an especially hot buffer is
 locking up the sweep.  This is exploiting the fact that from
 StrategyGetBuffer we don't need a *particular* buffer, just *a*
 buffer.

 Hm.  You could argue in fact that if there's contention for the buffer
 header, that's proof that it's busy and shouldn't have its usage count
 decremented.  So this seems okay from a logical standpoint.

 However, I'm not real sure that it's possible to do a conditional
 spinlock acquire that doesn't create just as much hardware-level
 contention as a full acquire (ie, TAS is about as bad whether it
 gets the lock or not).  So the actual benefit is a bit less clear.

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately.  if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount.  so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
 this would of course require some amusing adjustments to various
logical checks (usage_count = 0, heh).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Ants Aasma
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure mmonc...@gmail.com wrote:
 well if you do a non-locking test first you could at least avoid some
 cases (and, if you get the answer wrong, so what?) by jumping to the
 next buffer immediately.  if the non locking test comes good, only
 then do you do a hardware TAS.

 you could in fact go further and dispense with all locking in front of
 usage_count, on the premise that it's only advisory and not a real
 refcount.  so you only then lock if/when it's time to select a
 candidate buffer, and only then when you did a non locking test first.
  this would of course require some amusing adjustments to various
 logical checks (usage_count = 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-22 Thread Atri Sharma

 Moreover, if the buffer happens to miss a decrement due to a data
 race, there's a good chance that the buffer is heavily used and
 wouldn't need to be evicted soon anyway. (if you arrange it to be a
 read-test-inc/dec-store operation then you will never go out of
 bounds) However, clocksweep and usage_count maintenance is not what is
 causing contention because that workload is distributed. The issue is
 pinning and unpinning. There we need an accurate count and there are
 some pages like index roots that get hit very heavily. Things to do
 there would be in my opinion convert to a futex based spinlock so when
 there is contention it doesn't completely kill performance and then
 try to get rid of the contention. Converting to lock-free pinning
 won't help much here as what is killing us here is the cacheline
 bouncing.

 One way to get rid of contention is the buffer nailing idea that
 Robert came up with. If some buffer gets so hot that maintaining
 refcount on the buffer header leads to contention, promote that buffer
 to a nailed status, let everyone keep their pin counts locally and
 sometime later revisit the nailing decision and if necessary convert
 pins back to the buffer header.

 One other interesting idea I have seen is closeable scalable nonzero
 indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
 use a tree structure to dynamically stripe access to the shared lock
 counter when contention is detected. Downside is that considerable
 amount of shared memory is needed so there needs to be some way to
 limit the resource usage. This is actually somewhat isomorphic to the
 nailing idea.

 The issue with the current buffer management algorithm is that it
 seems to scale badly with increasing shared_buffers. I think the
 improvements should concentrate on finding out what is the problem
 there and figuring out how to fix it. A simple idea to test would be
 to just partition shared buffers along with the whole clock sweep
 machinery into smaller ones, like the buffer mapping hash tables
 already are. This should at the very least reduce contention for the
 clock sweep even if it doesn't reduce work done per page miss.


One way to distribute memory contention in case of spinlocks could be
to utilize the fundamentals of NUMA architecture. Specifically, we can
let the contending backends spin on local flags instead on the buffer
header flags directly. As access to local cache lines is much cheaper
and faster than memory locations which are far away in NUMA, we could
potentially reduce the memory overhead for a specific line and reduce
the overall overheads as well.

Regards.

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page replacement algorithm in buffer cache

2013-03-21 Thread Amit Kapila
On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:
 Hello all,
 
 Sorry if this is a naive question.
 
 I was going through Greg Smith's slides on buffer
 cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
 he.pdf).
 When going through the page replacement algorithm that we use i.e.
 clocksweep algorithm, I felt a potential problem in our current
 system.
 
 Specifically, when a new entry is allocated in the buffer, it's
 USAGE_COUNT is set to 1. On each sweep of the algorithm, the
 USAGE_COUNT is decremented and an entry whose  USAGE_COUNT becomes
 zero is replaced.

Yes, it is replaced but in the next clock sweep pass, not immediately after
making 0.
So till the time of next pass if nobody accesses the buffer and all other
buffers have higher count, it can be replaced.
Also the buffer, it has returned for which the usage count becomes 1, it
will come to reduce the usage count only in next pass.
So in whole, I think it needs 2 passes for a freshly returned buffer to be
re-used incase no one uses it again.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers