subject:"Re\: \[HACKERS\] Page replacement algorithm in buffer cache"

Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-09 Thread Amit Kapila

 -Original Message-
 From: Robert Haas [mailto:robertmh...@gmail.com]
 Sent: Tuesday, April 09, 2013 9:28 AM
 To: Amit Kapila
 Cc: Greg Smith; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Page replacement algorithm in buffer cache

 On Fri, Apr 5, 2013 at 11:08 PM, Amit Kapila amit.kap...@huawei.com
 wrote:
  I still have one more doubt, consider the below scenario for cases
 when we
  Invalidate buffers during moving to freelist v/s just move to
 freelist

 Backend got the buffer from freelist for a request of page-9
 (number 9 is
  random, just to explain), it still have association with another
 page-10
 It needs to add the buffer with new tag (new page association) in
 bufhash
  table and remove the buffer with oldTag (old page association).

  The benefit for just moving to freelist is that if we get request of
 same
  page until somebody else used it for another page, it will save read
 I/O.
  However on the other side for many cases
  Backend will need extra partition lock to remove oldTag (which can
 lead to
  some bottleneck).

  I think saving read I/O is more beneficial but just not sure if that
 is best
  as cases might be less for it.

 I think saving read I/O is a lot more beneficial.  I haven't seen
 evidence of a severe bottleneck updating the buffer mapping tables.  I
 have seen some evidence of spinlock-level contention on read workloads
 that fit in shared buffers, because in that case the system can run
 fast enough for the spinlocks protecting the lwlocks to get pretty
 hot.  But if you're doing writes, or if the workload doesn't fit in
 shared buffers, other bottlenecks slow you down enough that this
 doesn't really seem to become much of an issue.

 Also, even if you *can* find some scenario where pushing the buffer
 invalidation into the background is a win, I'm not convinced that
 would justify doing it, because the case where it's a huge loss -
 namely, working set just a tiny bit smaller than shared_buffers - is
 pretty obvious. I don't think we dare fool around with that; the
 townspeople will arrive with pitchforks.

 I believe that the big win here is getting the clock sweep out of the
 foreground so that BufFreelistLock doesn't catch fire.  The buffer
 mapping locks are partitioned and, while it's not like that completely
 gets rid of the contention, it sure does help a lot.  So I would view
 that goal as primary, at least for now.  If we get a first round of
 optimization done in this area, that doesn't preclude further
 improving it in the future.

I agree with you that this can be first step towards improvement.

  Last time following tests have been executed to validate the results:

  Test suite - pgbench
  DB Size - 16 GB
  RAM - 24 GB
  Shared Buffers - 2G, 5G, 7G, 10G
  Concurrency - 8, 16, 32, 64 clients
  Pre-warm the buffers before start of test

  Shall we try for any other scenario's or for initial test of patch
 above are
  okay.

 Seems like a reasonable place to start.

I shall work on this for first CF of 9.4.

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-08 Thread Robert Haas

On Fri, Apr 5, 2013 at 11:08 PM, Amit Kapila amit.kap...@huawei.com wrote:
 I still have one more doubt, consider the below scenario for cases when we
 Invalidate buffers during moving to freelist v/s just move to freelist

Backend got the buffer from freelist for a request of page-9 (number 9 is
 random, just to explain), it still have association with another page-10
It needs to add the buffer with new tag (new page association) in bufhash
 table and remove the buffer with oldTag (old page association).

 The benefit for just moving to freelist is that if we get request of same
 page until somebody else used it for another page, it will save read I/O.
 However on the other side for many cases
 Backend will need extra partition lock to remove oldTag (which can lead to
 some bottleneck).

 I think saving read I/O is more beneficial but just not sure if that is best
 as cases might be less for it.

I think saving read I/O is a lot more beneficial.  I haven't seen
evidence of a severe bottleneck updating the buffer mapping tables.  I
have seen some evidence of spinlock-level contention on read workloads
that fit in shared buffers, because in that case the system can run
fast enough for the spinlocks protecting the lwlocks to get pretty
hot.  But if you're doing writes, or if the workload doesn't fit in
shared buffers, other bottlenecks slow you down enough that this
doesn't really seem to become much of an issue.

Also, even if you *can* find some scenario where pushing the buffer
invalidation into the background is a win, I'm not convinced that
would justify doing it, because the case where it's a huge loss -
namely, working set just a tiny bit smaller than shared_buffers - is
pretty obvious. I don't think we dare fool around with that; the
townspeople will arrive with pitchforks.

I believe that the big win here is getting the clock sweep out of the
foreground so that BufFreelistLock doesn't catch fire.  The buffer
mapping locks are partitioned and, while it's not like that completely
gets rid of the contention, it sure does help a lot.  So I would view
that goal as primary, at least for now.  If we get a first round of
optimization done in this area, that doesn't preclude further
improving it in the future.

 Last time following tests have been executed to validate the results:

 Test suite - pgbench
 DB Size - 16 GB
 RAM - 24 GB
 Shared Buffers - 2G, 5G, 7G, 10G
 Concurrency - 8, 16, 32, 64 clients
 Pre-warm the buffers before start of test

 Shall we try for any other scenario's or for initial test of patch above are
 okay.

Seems like a reasonable place to start.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-05 Thread Robert Haas

On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila amit.kap...@huawei.com wrote:
 If we just put it to freelist, then next time if it get allocated directly
 from bufhash table, then who will remove it from freelist
 or do you think that, in BufferAlloc, if it gets from bufhash table, then it
 should verify if it's in freelist, then remove from freelist.

No, I don't think that's necessary.  We already have the following
guard in StrategyGetBuffer:

if (buf-refcount == 0  buf-usage_count == 0)
{
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}

If a buffer is allocated from the freelist and it turns out that it
actually has a non-zero reference count or a non-zero pin count, we
just discard it and pull the next buffer off the freelist instead.
So, in the scenario you describe, the buffer gets reallocated (due to
a non-NULL BufferAccessStrategy, presumably) and then somebody comes a
long and pulls it off the freelist.  But, since the buffer has just
been used by someone else, it'll most likely be pinned or have a
non-zero usage count, so we'll just skip it and allocate some other
buffer instead.  No harm done.

Now, it is possible that the buffer could get added to the freelist,
then allocated via a BufferAccessStrategy, and then the clock sweep
could hit it and push the usage count back to 0.  But that's no big
deal either: if we go to put it on the freelist and see (via
buf-freeNext) that it's already there, we can just leave it where it
is (or maybe move it to the end).  On a related note, we probably need
a variant of StrategyFreeBuffer which pushes buffers onto the end of
the freelist rather than the front.  It makes sense to stick
invalidated buffers on the front of the list (which is what
StrategyFreeBuffer does), but non-invalidated buffers should be placed
at the end to more closely approximate LRU.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-05 Thread Amit Kapila

On Saturday, April 06, 2013 12:38 AM Robert Haas wrote:
 On Fri, Apr 5, 2013 at 1:12 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  If we just put it to freelist, then next time if it get allocated
 directly
  from bufhash table, then who will remove it from freelist
  or do you think that, in BufferAlloc, if it gets from bufhash table,
 then it
  should verify if it's in freelist, then remove from freelist.
 
 No, I don't think that's necessary.  We already have the following
 guard in StrategyGetBuffer:
 
 if (buf-refcount == 0  buf-usage_count == 0)
 {
 if (strategy != NULL)
 AddBufferToRing(strategy, buf);
 return buf;
 }
 
 If a buffer is allocated from the freelist and it turns out that it
 actually has a non-zero reference count or a non-zero pin count, we
 just discard it and pull the next buffer off the freelist instead.
 So, in the scenario you describe, the buffer gets reallocated (due to
 a non-NULL BufferAccessStrategy, presumably) and then somebody comes a
 long and pulls it off the freelist.  But, since the buffer has just
 been used by someone else, it'll most likely be pinned or have a
 non-zero usage count, so we'll just skip it and allocate some other
 buffer instead.  No harm done.

Yes, you are right, I have missed that part of code while thinking of this
scenario, but I was talking about NULL BufferAccessStrategy as well.

I still have one more doubt, consider the below scenario for cases when we
Invalidate buffers during moving to freelist v/s just move to freelist

   Backend got the buffer from freelist for a request of page-9 (number 9 is
random, just to explain), it still have association with another page-10
   It needs to add the buffer with new tag (new page association) in bufhash
table and remove the buffer with oldTag (old page association).

The benefit for just moving to freelist is that if we get request of same
page until somebody else used it for another page, it will save read I/O.
However on the other side for many cases
Backend will need extra partition lock to remove oldTag (which can lead to
some bottleneck).

I think saving read I/O is more beneficial but just not sure if that is best
as cases might be less for it.
 
 Now, it is possible that the buffer could get added to the freelist,
 then allocated via a BufferAccessStrategy, and then the clock sweep
 could hit it and push the usage count back to 0.  But that's no big
 deal either: if we go to put it on the freelist and see (via
 buf-freeNext) that it's already there, we can just leave it where it
 is (or maybe move it to the end).  On a related note, we probably need
 a variant of StrategyFreeBuffer which pushes buffers onto the end of
 the freelist rather than the front.  It makes sense to stick
 invalidated buffers on the front of the list (which is what
 StrategyFreeBuffer does), but non-invalidated buffers should be placed
 at the end to more closely approximate LRU.

Okay.

Last time following tests have been executed to validate the results:

Test suite - pgbench
DB Size - 16 GB 
RAM - 24 GB
Shared Buffers - 2G, 5G, 7G, 10G
Concurrency - 8, 16, 32, 64 clients
Pre-warm the buffers before start of test

Shall we try for any other scenario's or for initial test of patch above are
okay.


With Regards,
Amit Kapila.








-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-04 Thread Amit Kapila

On Thursday, April 04, 2013 7:19 AM Greg Smith wrote:
 On 4/2/13 11:54 AM, Robert Haas wrote:
  But, having said that, I still think the best idea is what Andres
  proposed, which pretty much matches my own thoughts: the bgwriter
  needs to populate the free list, so that buffer allocations don't
 have
  to wait for linear scans of the buffer array.
 
 I was hoping this one would make it to a full six years of being on the
 TODO list before it came up again, missed it by a few weeks.  The
 funniest part is that Amit even submitted a patch on this theme a few
 months ago without much feedback:
 http://www.postgresql.org/message-
 id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
   That stalled where a few things have, on a) needing more regression
 test workloads, and b) wondering just what the deal with large
 shared_buffers setting degrading performance was.

For b), below are links where it decreased due to large shared buffers.

http://www.postgresql.org/message-id/attachment/27489/Results.htm
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C38285442C
5@szxeml509-mbx


As per my observation, it occur when I/O starts. The dip could be due to
fluctuation or may be due to some OS scheduling or it could be due to
Eviction of dirty pages sooner than it would otherwise.

I think the further investigation can be more meaningful if the results can
be taken by someone else other than me.

One idea to proceed in this line could be we start with this patch and then
based on results, do the further experiments to make it more useful.  

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page replacement algorithm in buffer cache

2013-04-04 Thread Robert Haas

On Wed, Apr 3, 2013 at 9:49 PM, Greg Smith g...@2ndquadrant.com wrote:
On 4/2/13 11:54 AM, Robert Haas wrote:
But, having said that, I still think the best idea is what Andres
proposed, which pretty much matches my own thoughts: the bgwriter
needs to populate the free list, so that buffer allocations don't have
to wait for linear scans of the buffer array.

I was hoping this one would make it to a full six years of being on the TODO
list before it came up again, missed it by a few weeks. The funniest part
is that Amit even submitted a patch on this theme a few months ago without
much feedback:
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382852FF97@szxeml509-mbs
That stalled where a few things have, on a) needing more regression test
workloads, and b) wondering just what the deal with large shared_buffers
setting degrading performance was.

Those are impressive results. I think we should seriously consider
doing something like that for 9.4. TBH, although more workloads to
test is always better, I don't think this problem is so difficult that
we can't have some confidence in a theoretical analysis. If I read
the original thread correctly (and I haven't looked at the patch
itself), the proposed patch would actually invalidate buffers before
putting them on the freelist. That effectively amounts to reducing
shared_buffers, so workloads that are just on the edge of what can fit
in shared_buffers will be harmed, and those that benefit incrementally
from increased shared_buffers will be as well.

What I think we should do instead is collect the buffers that we think
are evictable and stuff them onto the freelist without invalidating
them. When a backend allocates from the freelist, it can double-check
that the buffer still has usage_count 0. The odds should be pretty
good. But even if we sometimes notice that the buffer has been
touched again after being put on the freelist, we haven't expended all
that much extra effort, and that effort happened mostly in the
background. Consider a scenario where only 10% of the buffers have
usage count 0 (which is not unrealistic). We scan 5000 buffers and
put 500 on the freelist. Now suppose that, due to some accident of
the workload, 75% of those buffers get touched again before they're
allocated off the freelist (which I believe to be a pessimistic
estimate for most workloads). Now, that means that only 125 of those
500 buffers will succeed in satisfying an allocation request. That's
still a huge win, because it means that each backend only has examine
an average of 4 buffers before it finds one to allocate. If it had
needed to do the freelist scan itself, it would have had to touch 40
buffers before finding one to allocate.

In real life, I think the gains are apt to be, if anything, larger.
IME, it's common for most or all of the buffer pool to be pinned at
usage count 5. So you could easily have a situation where the arena
scan has to visit millions of buffers to find one to allocate. If
that's happening in the background instead of the foreground, it's a
huge win. Also, note that there's nothing to prevent the arena scan
from happening in parallel with allocations off of the freelist - so
while foreground processes are emptying the freelist, the background
process can be looking for more things to add to it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

80 matches

Mail list logo