Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-20 Thread Heikki Linnakangas

On 03.01.2012 17:56, Simon Riggs wrote:

On Tue, Jan 3, 2012 at 3:18 PM, Robert Haasrobertmh...@gmail.com  wrote:


2. When a backend can't find a free buffer, it spins for a long time
while holding the lock. This makes the buffer strategy O(N) in its
worst case, which slows everything down. Notably, while this is
happening the bgwriter sits doing nothing at all, right at the moment
when it is most needed. The Clock algorithm is an approximation of an
LRU, so is already suboptimal as a perfect cache. Tweaking to avoid
worst case behaviour makes sense. How much to tweak? Well,...


I generally agree with this analysis, but I don't think the proposed
patch is going to solve the problem.  It may have some merit as a way
of limiting the worst case behavior.  For example, if every shared
buffer has a reference count of 5, the first buffer allocation that
misses is going to have a lot of work to do before it can actually
come up with a victim.  But I don't think it's going to provide good
scaling in general.  Even if the background writer only spins through,
on average, ten or fifteen buffers before finding one to evict, that
still means we're acquiring ten or fifteen spinlocks while holding
BufFreelistLock. I don't currently have the measurements to prove
that's too expensive, but I bet it is.


I think its worth reducing the cost of scanning, but that has little
to do with solving the O(N) problem. I think we need both.

I've left the way open for you to redesign freelist management in many
ways. Please take the opportunity and go for it, though we must
realise that major overhauls require significantly more testing to
prove their worth. Reducing spinlocking only sounds like a good way to
proceed for this release.

If you don't have time in 9.2, then these two small patches are worth
having. The bgwriter locking patch needs less formal evidence to show
its worth. We simply don't need to have a bgwriter that just sits
waiting doing nothing.


I'd like to see some benchmarks that show a benefit from these patches, 
before committing something like this that complicates the code. These 
patches are fairly small, but nevertheless. Once we have a test case, we 
can argue whether the benefit we're seeing is worth the extra code, or 
if there's some better way to achieve it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-20 Thread Robert Haas
On Fri, Jan 20, 2012 at 9:29 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 I'd like to see some benchmarks that show a benefit from these patches,
 before committing something like this that complicates the code. These
 patches are fairly small, but nevertheless. Once we have a test case, we can
 argue whether the benefit we're seeing is worth the extra code, or if
 there's some better way to achieve it.

Agreed.  I don't really like either of these patches stylistically:
when you start sometimes locking things and sometimes not locking
things, it gets hard to reason about how the code will behave, and you
limit what you can in the future do inside the critical section
because you have to keep in mind that you don't really have full
mutual exclusion.  There are places where it may be worth making that
trade-off if we can show a large performance benefit, but I am wary of
quick hacks that may get in the way of more complete solutions we want
to implement later, especially if the performance benefit is marginal.

I'm in the process of trying to pull together benchmark numbers for
the various performance-related patches that are floating around this
CommitFest, but a big problem is that many of them need to be tested
in different ways, which are frequently not explicitly laid out in the
emails in which they are submitted.  Some of them are calculated to
improve latency, and others throughput.  Some require pgbench run with
a small scale factor, some require pgbench with a large scale factor,
and some need some completely other type of testing altogether.  Some
need short tests, others need long tests, still others require testing
with very specific parameters, and some don't really need more testing
so much as they need profiling.  Many (like the double-write patch)
need more thought about worst case scenarios, like a sequential scan
on a large, unhinted table.  Some only really need testing in one or
two scenarios, but others could affect a broad variety of workloads
and really ought to be tested in many different ways to fully
understand the behavior.   Some of them may be interdependent in
complex ways.

My current belief, based on the testing I did before Christmas and my
gut feelings about the remaining patches, is that the patches which
have the best chance of making a major difference on general workloads
are your XLOG insert scaling patch and the group commit patch.  Those
are, in fact, virtually the only ones that have had more than zero
test results posted to the list so far, and those test results are
extremely promising.  The various checkpoint-related patches also seem
somewhat promising to me, despite the lack (so far) of any test
results.  Everything else strikes me as likely to make a difference
that is either very small or only applicable in a fairly circumscribed
set of cases.  This preliminary opinion is subject to change as more
evidence comes in, of course.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-05 Thread Greg Smith

On 01/03/2012 06:22 PM, Jim Nasby wrote:

On Jan 3, 2012, at 11:15 AM, Robert Haas wrote:
   

I think that our current freelist is practically useless, because it
is almost always empty, and the cases where it's not empty (startup,
and after a table or database drop) are so narrow that we don't really
get any benefit out of having it.  However, I'm not opposed to the
idea of a freelist in general: I think that if we actually put in some
effort to keep the freelist in a non-empty state it would help a lot,
because backends would then have much less work to do at buffer
allocation time.
 

This is exactly what the FreeBSD VM system does (which is at least one of the 
places where the idea of a clock sweep for PG came from ages ago). There is a 
process that does nothing but attempt to keep X amount of memory on the free 
list, where it can immediately be grabbed by anything that needs memory. Pages 
on the freelist are guaranteed to be clean (as in not dirty), but not zero'd. 
In fact, IIRC if a page on the freelist gets referenced again it can be pulled 
back out of the free list and put back into an active state.

The one downside I see to this is that we'd need some heuristic to determine 
how many buffers we want to keep on the free list.
   


http://wiki.postgresql.org/wiki/Todo#Background_Writer has Consider 
adding buffers the background writer finds reusable to the free list 
and Automatically tune bgwriter_delay based on activity rather then 
using a fixed interval, which both point to my 8.3 musing and other 
suggestionss starting at 
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php  I 
could write both those in an afternoon.  The auto-tuning stuff already 
in the background writer originally intended to tackle this issue, but 
dropped it in lieu of shipping something simpler first.  There's even a 
prototype somewhere on an old drive here.


The first missing piece needed before this was useful was separating out 
the background writer and checkpointer processes.  Once I realized the 
checkpoints were monopolizing so much time, especially when they hit bad 
states, it was obvious the writer couldn't be relied upon for this job.  
That's much better now since Simon's 
806a2aee3791244bf0f916729bfdb5489936e068 Split work of bgwriter between 
2 processes: bgwriter and checkpointer, which just became available in 
November to build on.


The second missing piece blocking this work in my mind was how exactly 
we're going to benchmark the result, mainly to prove it doesn't hurt 
some workloads.  I haven't fully internalized the implications of 
Robert's upthread comments, in terms of being able to construct a 
benchmark stressing both the best and worst case situation here.  That's 
really the hardest part of this whole thing, by a lot.  Recent spending 
has brought me an 8 HyperThread core laptop that can also run DTrace, so 
I expect to have better visibility into this soon too.


I think here in 2011 the idea of having a background writer process that 
could potentially occupy most of a whole core doing work so backends 
don't have to is an increasingly attractive one.  So long as that comes 
along with an auto-tuning delay, it shouldn't hurt the work toward 
lowering power management either.  Might even help really, by allowing 
larger values of bgwriter_delay than you'd want to use during busy 
periods.  I was planning to mimic the sort of fast attack/slow delay 
logic already used for the auto-tuned timing, so that you won't fall 
behind by more than one bgwriter_delay worth of activity.  Then it 
should realize a burst is here and the writer has to start moving faster.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-03 Thread Robert Haas
On Mon, Jan 2, 2012 at 2:53 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Get rid of the freelist?  Once shared buffers are full, it's just about
 useless anyway.  But you'd need to think about the test cases that you
 pay attention to, as there might be scenarios where it remains useful.

 Agree freelist is mostly useless, but startup and dropping objects requires 
 it.

Not really.  If someone drops an object, we must invalidate all the
buffers immediately, but adding them to the free list is just an
optimization to make sure we reuse the invalidated buffers in
preference to evicting buffers that still have valid contents.  But I
suspect that in practice this is not a very important optimization,
because (1) most people probably don't drop permanent tables or
databases very often and (2) buffers that are being heavily used
should have a positive reference count, in which case the clock sweep
will pass over them and land on one of the newly-invalidated buffers
anyway.

 When its full its just an integer  0 test, which is cheap and its on
 the same cacheline as the nextVictimBuffer anyway, so we have to fetch
 it.

 The clock sweep is where all the time goes, in its current form.

...but I agree with this.  In its current form, the clock sweep has to
acquire a spinlock for every buffer it touches.  That's really
expensive, and I think we need to either get rid of it altogether or
at least find some way to move it into the background.  Ideally, in
the common case, a backend that wants to allocate a buffer shouldn't
need to do more than acquire a spinlock, pop a buffer off some sort of
linked list of allocation targets, and release the spinlock.  Whatever
other computation is needed should get pushed out of foreground
processing.

 We have 2 problems

 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
 bgwriter has to fight with backends to find out where it should clean.
 As a result it spends lots of time waiting and insufficient time
 cleaning.

I have trouble accepting that this is really the problem we should be
solving.  There's only one background writer process, and that process
is waking up every 200ms by default.  Many people probably tune that
down quite a bit, but unless I'm quite mistaken, it should still be a
drop in the bucket next to what the backends are doing.

 2. When a backend can't find a free buffer, it spins for a long time
 while holding the lock. This makes the buffer strategy O(N) in its
 worst case, which slows everything down. Notably, while this is
 happening the bgwriter sits doing nothing at all, right at the moment
 when it is most needed. The Clock algorithm is an approximation of an
 LRU, so is already suboptimal as a perfect cache. Tweaking to avoid
 worst case behaviour makes sense. How much to tweak? Well,...

I generally agree with this analysis, but I don't think the proposed
patch is going to solve the problem.  It may have some merit as a way
of limiting the worst case behavior.  For example, if every shared
buffer has a reference count of 5, the first buffer allocation that
misses is going to have a lot of work to do before it can actually
come up with a victim.  But I don't think it's going to provide good
scaling in general.  Even if the background writer only spins through,
on average, ten or fifteen buffers before finding one to evict, that
still means we're acquiring ten or fifteen spinlocks while holding
BufFreelistLock. I don't currently have the measurements to prove
that's too expensive, but I bet it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-03 Thread Simon Riggs
On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:

 The clock sweep is where all the time goes, in its current form.

 ...but I agree with this.  In its current form, the clock sweep has to
 acquire a spinlock for every buffer it touches.  That's really
 expensive, and I think we need to either get rid of it altogether or
 at least find some way to move it into the background.  Ideally, in
 the common case, a backend that wants to allocate a buffer shouldn't
 need to do more than acquire a spinlock, pop a buffer off some sort of
 linked list of allocation targets, and release the spinlock.  Whatever
 other computation is needed should get pushed out of foreground
 processing.

So you don't think a freelist is worth having, but you want a list of
allocation targets.
What is the practical difference?


 We have 2 problems

 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
 bgwriter has to fight with backends to find out where it should clean.
 As a result it spends lots of time waiting and insufficient time
 cleaning.

 I have trouble accepting that this is really the problem we should be
 solving.  There's only one background writer process, and that process
 is waking up every 200ms by default.  Many people probably tune that
 down quite a bit, but unless I'm quite mistaken, it should still be a
 drop in the bucket next to what the backends are doing.

That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes,
most of the contention is caused by the other backends, but the
bgwriter experiences that contention currently when it has no need to
do so.

Presumably if you did have an allocation list maintained in the
background by the bgwriter you wouldn't expect the bgwriter to wait
behind a lock for no reason when it could be doing useful work.


 2. When a backend can't find a free buffer, it spins for a long time
 while holding the lock. This makes the buffer strategy O(N) in its
 worst case, which slows everything down. Notably, while this is
 happening the bgwriter sits doing nothing at all, right at the moment
 when it is most needed. The Clock algorithm is an approximation of an
 LRU, so is already suboptimal as a perfect cache. Tweaking to avoid
 worst case behaviour makes sense. How much to tweak? Well,...

 I generally agree with this analysis, but I don't think the proposed
 patch is going to solve the problem.  It may have some merit as a way
 of limiting the worst case behavior.  For example, if every shared
 buffer has a reference count of 5, the first buffer allocation that
 misses is going to have a lot of work to do before it can actually
 come up with a victim.  But I don't think it's going to provide good
 scaling in general.  Even if the background writer only spins through,
 on average, ten or fifteen buffers before finding one to evict, that
 still means we're acquiring ten or fifteen spinlocks while holding
 BufFreelistLock. I don't currently have the measurements to prove
 that's too expensive, but I bet it is.

I think its worth reducing the cost of scanning, but that has little
to do with solving the O(N) problem. I think we need both.

I've left the way open for you to redesign freelist management in many
ways. Please take the opportunity and go for it, though we must
realise that major overhauls require significantly more testing to
prove their worth. Reducing spinlocking only sounds like a good way to
proceed for this release.

If you don't have time in 9.2, then these two small patches are worth
having. The bgwriter locking patch needs less formal evidence to show
its worth. We simply don't need to have a bgwriter that just sits
waiting doing nothing.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-03 Thread Robert Haas
On Tue, Jan 3, 2012 at 10:56 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas robertmh...@gmail.com wrote:
 The clock sweep is where all the time goes, in its current form.

 ...but I agree with this.  In its current form, the clock sweep has to
 acquire a spinlock for every buffer it touches.  That's really
 expensive, and I think we need to either get rid of it altogether or
 at least find some way to move it into the background.  Ideally, in
 the common case, a backend that wants to allocate a buffer shouldn't
 need to do more than acquire a spinlock, pop a buffer off some sort of
 linked list of allocation targets, and release the spinlock.  Whatever
 other computation is needed should get pushed out of foreground
 processing.

 So you don't think a freelist is worth having, but you want a list of
 allocation targets.
 What is the practical difference?

I think that our current freelist is practically useless, because it
is almost always empty, and the cases where it's not empty (startup,
and after a table or database drop) are so narrow that we don't really
get any benefit out of having it.  However, I'm not opposed to the
idea of a freelist in general: I think that if we actually put in some
effort to keep the freelist in a non-empty state it would help a lot,
because backends would then have much less work to do at buffer
allocation time.

 I have trouble accepting that this is really the problem we should be
 solving.  There's only one background writer process, and that process
 is waking up every 200ms by default.  Many people probably tune that
 down quite a bit, but unless I'm quite mistaken, it should still be a
 drop in the bucket next to what the backends are doing.

 That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes,
 most of the contention is caused by the other backends, but the
 bgwriter experiences that contention currently when it has no need to
 do so.

Sure, but if that contention is a negligible fraction of the total and
doesn't cost anything measurable, then it doesn't make sense to add
code to eliminate it.  There are all sorts of places in the system
where we have contention at a level that doesn't affect performance.
Many of those could probably be fixed by adding more complicated
locking, but that would just complicate the code to no purpose.  If
there's a demonstrable performance benefit to this change then it's
worth a harder look, but my gut feeling is that there won't be.

 2. When a backend can't find a free buffer, it spins for a long time
 while holding the lock. This makes the buffer strategy O(N) in its
 worst case, which slows everything down. Notably, while this is
 happening the bgwriter sits doing nothing at all, right at the moment
 when it is most needed. The Clock algorithm is an approximation of an
 LRU, so is already suboptimal as a perfect cache. Tweaking to avoid
 worst case behaviour makes sense. How much to tweak? Well,...

 I generally agree with this analysis, but I don't think the proposed
 patch is going to solve the problem.  It may have some merit as a way
 of limiting the worst case behavior.  For example, if every shared
 buffer has a reference count of 5, the first buffer allocation that
 misses is going to have a lot of work to do before it can actually
 come up with a victim.  But I don't think it's going to provide good
 scaling in general.  Even if the background writer only spins through,
 on average, ten or fifteen buffers before finding one to evict, that
 still means we're acquiring ten or fifteen spinlocks while holding
 BufFreelistLock. I don't currently have the measurements to prove
 that's too expensive, but I bet it is.

 I think its worth reducing the cost of scanning, but that has little
 to do with solving the O(N) problem. I think we need both.

Maybe.  I would really like a unified solution to both problems, but
I'd be willing to accept a solution for one of them if we're confident
that we've actually solved it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-03 Thread Jim Nasby
On Jan 3, 2012, at 11:15 AM, Robert Haas wrote:
 So you don't think a freelist is worth having, but you want a list of
 allocation targets.
 What is the practical difference?
 
 I think that our current freelist is practically useless, because it
 is almost always empty, and the cases where it's not empty (startup,
 and after a table or database drop) are so narrow that we don't really
 get any benefit out of having it.  However, I'm not opposed to the
 idea of a freelist in general: I think that if we actually put in some
 effort to keep the freelist in a non-empty state it would help a lot,
 because backends would then have much less work to do at buffer
 allocation time.

This is exactly what the FreeBSD VM system does (which is at least one of the 
places where the idea of a clock sweep for PG came from ages ago). There is a 
process that does nothing but attempt to keep X amount of memory on the free 
list, where it can immediately be grabbed by anything that needs memory. Pages 
on the freelist are guaranteed to be clean (as in not dirty), but not zero'd. 
In fact, IIRC if a page on the freelist gets referenced again it can be pulled 
back out of the free list and put back into an active state.

The one downside I see to this is that we'd need some heuristic to determine 
how many buffers we want to keep on the free list.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-03 Thread Robert Haas
On Tue, Jan 3, 2012 at 6:22 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 3, 2012, at 11:15 AM, Robert Haas wrote:
 So you don't think a freelist is worth having, but you want a list of
 allocation targets.
 What is the practical difference?

 I think that our current freelist is practically useless, because it
 is almost always empty, and the cases where it's not empty (startup,
 and after a table or database drop) are so narrow that we don't really
 get any benefit out of having it.  However, I'm not opposed to the
 idea of a freelist in general: I think that if we actually put in some
 effort to keep the freelist in a non-empty state it would help a lot,
 because backends would then have much less work to do at buffer
 allocation time.

 This is exactly what the FreeBSD VM system does (which is at least one of the 
 places where the idea of a clock sweep for PG came from ages ago). There is a 
 process that does nothing but attempt to keep X amount of memory on the free 
 list, where it can immediately be grabbed by anything that needs memory. 
 Pages on the freelist are guaranteed to be clean (as in not dirty), but not 
 zero'd. In fact, IIRC if a page on the freelist gets referenced again it can 
 be pulled back out of the free list and put back into an active state.

 The one downside I see to this is that we'd need some heuristic to determine 
 how many buffers we want to keep on the free list.

Fortuitously, I believe the background writer already has most of the
necessary logic: it attempts to predict how many buffers are about to
be needed - I think based on a decaying average.

Actually, I think that logic could use some improvement, because I
believe I've heard Greg Smith comment that it's often necessary to
tune bgwriter_delay downward.  It'd be nice to make the delay adaptive
somehow, to avoid the need for manual tuning (and unnecessary wake-ups
when the system goes idle).

But possibly the existing logic is good enough for a first cut.
However, in the interest of full disclosure, I'll admit that I've done
no testing in this area at all and am talking mostly out of my
posterior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-02 Thread Simon Riggs
On Sun, Aug 14, 2011 at 7:33 PM, Robert Haas robertmh...@gmail.com wrote:

 Simon is proposing to bound the
 really bad case where you flip through the entire ring multiple times
 before you find a buffer, and that may well be worth doing.  But I
 think even scanning 100 buffers every time you need to bring something
 in is too slow.  What's indisputable is that a SELECT-only workload
 which is larger than shared_buffers can be very easily rate-limited by
 the speed at which BufFreelistLock can be taken and released.  If you
 have a better idea for solving that problem, I'm all ears...

I felt we were on the right track here for a while.

Does anyone dispute that BufFreelistLock is a problem? shared buffer
replacement is *not* O(k) and it definitely needs to be.

Does anyone have a better idea for reducing BufFreelistLock
contention? Something simple that will work for 9.2?

What steps are there between here and committing the freelist_ok.v2.patch?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-02 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Does anyone have a better idea for reducing BufFreelistLock
 contention? Something simple that will work for 9.2?

Get rid of the freelist?  Once shared buffers are full, it's just about
useless anyway.  But you'd need to think about the test cases that you
pay attention to, as there might be scenarios where it remains useful.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2012-01-02 Thread Simon Riggs
On Mon, Jan 2, 2012 at 5:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Does anyone have a better idea for reducing BufFreelistLock
 contention? Something simple that will work for 9.2?

 Get rid of the freelist?  Once shared buffers are full, it's just about
 useless anyway.  But you'd need to think about the test cases that you
 pay attention to, as there might be scenarios where it remains useful.

Agree freelist is mostly useless, but startup and dropping objects requires it.

When its full its just an integer  0 test, which is cheap and its on
the same cacheline as the nextVictimBuffer anyway, so we have to fetch
it.

The clock sweep is where all the time goes, in its current form.

We have 2 problems

1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
bgwriter has to fight with backends to find out where it should clean.
As a result it spends lots of time waiting and insufficient time
cleaning.

2. When a backend can't find a free buffer, it spins for a long time
while holding the lock. This makes the buffer strategy O(N) in its
worst case, which slows everything down. Notably, while this is
happening the bgwriter sits doing nothing at all, right at the moment
when it is most needed. The Clock algorithm is an approximation of an
LRU, so is already suboptimal as a perfect cache. Tweaking to avoid
worst case behaviour makes sense. How much to tweak? Well,...

(1) is fixed by buffreelistlock_reduction.v1.patch

(2) is fixed by freelist_ok.v2.patch

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 91cc001..86af943 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1242,7 +1242,7 @@ BufferSync(int flags)
 	 * Note that we don't read the buffer alloc count here --- that should be
 	 * left untouched till the next BgBufferSync() call.
 	 */
-	buf_id = StrategySyncStart(NULL, NULL);
+	buf_id = StrategySyncStart(NULL, NULL, NULL);
 	num_to_scan = NBuffers;
 	num_written = 0;
 	while (num_to_scan--  0)
@@ -1315,6 +1315,7 @@ BgBufferSync(void)
 	int			strategy_buf_id;
 	uint32		strategy_passes;
 	uint32		recent_alloc;
+	bool		with_lock;
 
 	/*
 	 * Information saved between calls so we can determine the strategy
@@ -1352,10 +1353,11 @@ BgBufferSync(void)
 	 * Find out where the freelist clock sweep currently is, and how many
 	 * buffer allocations have happened since our last call.
 	 */
-	strategy_buf_id = StrategySyncStart(strategy_passes, recent_alloc);
+	strategy_buf_id = StrategySyncStart(strategy_passes, recent_alloc, with_lock);
 
 	/* Report buffer alloc counts to pgstat */
-	BgWriterStats.m_buf_alloc += recent_alloc;
+	if (with_lock)
+		BgWriterStats.m_buf_alloc += recent_alloc;
 
 	/*
 	 * If we're not running the LRU scan, just stop after doing the stats
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3e62448..27d4ae8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -252,20 +252,57 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
  * being read.
  */
 int
-StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
+StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc, bool *withlock)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	result = StrategyControl-nextVictimBuffer;
-	if (complete_passes)
-		*complete_passes = StrategyControl-completePasses;
-	if (num_buf_alloc)
+	if (withlock)
 	{
-		*num_buf_alloc = StrategyControl-numBufferAllocs;
-		StrategyControl-numBufferAllocs = 0;
+		bool	gotlock = false;
+
+		/*
+		 * Try to get the lock without waiting.
+		 */
+		if (LWLockConditionalAcquire(BufFreelistLock, LW_EXCLUSIVE))
+			gotlock = true;
+
+		/*
+		 * If lock isn't available without waiting then mostly we don't
+		 * care and just read the nextVictimBuffer without a lock.
+		 *
+		 * Bgwriter stats reporting relies on us reading accurate values
+		 * and being able to reset the counter, so about 1% of the time
+		 * we want to wait for the lock so we stats don't get starved out.
+		 */
+		if (!gotlock  random() = (MAX_RANDOM_VALUE / 100))
+		{
+			LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+			gotlock = true;
+		}
+
+		if (gotlock)
+		{
+			*withlock = true;
+			result = StrategyControl-nextVictimBuffer;
+			if (complete_passes)
+*complete_passes = StrategyControl-completePasses;
+			if (num_buf_alloc)
+			{
+*num_buf_alloc = StrategyControl-numBufferAllocs;
+StrategyControl-numBufferAllocs = 0;
+			}
+			LWLockRelease(BufFreelistLock);
+
+			return result;
+		}
 	}
-	LWLockRelease(BufFreelistLock);
+
+	/*
+	 * We didn't get the lock, but read the value anyway on the assumption
+	 * that reading this value is atomic.
+	 */
+	result = StrategyControl-nextVictimBuffer;
+
 	return 

Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-15 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 
 Anyway, I think every idea thrown out here so far needs about an
 order of magnitude more types of benchmarking test cases before it
 can be evaluated at all.
 
Right.  I'm very excited about all the optimizations going in, and
can't see where the ones committed so far are very likely to cause
problems for workloads other than those tested, but here we're
heading into territory where the performance farm is very
desperately needed.
 
That said, I'll throw out one more idea, as something that worked
very well for me in a similar situation, and played well with
external caching which knew lots about the storage media but nothing
about the database side.  It's very similar to the clock sweep
algorithm, but uses a weighted average rather than a simple count. 
It didn't tend to leave as many buffers clustered at one end or the
other, and the gradation did seem to be useful.
 
I started out this post trying to describe it more generally, but it
all came out sounding too vague and hand-wavy; so I'll give a
description with more particulars which could, of course, be tweaked
without tossing the concept.  This particular description modifies
the techniques I've used to try to better fit the particulars of
PostgreSQL.  It may fall down on contention issues, but since it
benchmarked better for me than the alternatives, I thought it might
at least help spin off other useful ideas.
 
(1)  Each buffer would have a one byte count, incremented on access,
similar to the current count.  Of course, 255 would not wrap on
access, but be left unchanged.
 
(2)  We would have 256 int counters for how many buffers were
available with each access count.  These would be maintained when
the access counts changed.
 
(3)  While sweeping, we wouldn't decrement the access counters for
buffers we couldn't use; rather, there would be a boolean flag to
say whether to divide the access counts for non-chosen buffers by
two.
 
(4)  The reduce flag would be set when any access count goes to
255, and cleared after one complete sweep through the buffers.  (So,
obviously, we note the location when we set the flag.)
 
(5)  To find the next buffer to evict, we would find out what is the
lowest count in use, and sweep forward until we found one.
 
(6)  We would have a background process doing this sweeping and
count reduction which would try to keep a few evicted buffers in a
queue for quick pickup.  This queue would be the only source for
getting a buffer.  Getting a buffer would always be a very
lightweight operation if there's something in the queue.  The
background task would try very hard to keep the queue non-empty,
only coming to rest when the queue was full -- whatever that was
chosen to mean (possibly based on the sort of adaptive calculations
currently used by the background writer).
 
There are a lot of interesting ways this could interact with the
background writer.  One of the more intriguing to me would be for
the sweep process to estimate what current access count levels
will need to be used on its next sweep and queue up any at those
levels which are dirty for the background writer.  This is vaguely
conceptually similar to the wash marker used in some LRU lists.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-15 Thread Jim Nasby
On Aug 13, 2011, at 3:40 PM, Greg Stark wrote:
 It does kind of seem like your numbers indicate we're missing part of
 the picture though. The idea with the clock sweep algorithm is that
 you keep approximately 1/nth of the buffers with each of the n values.
 If we're allowing nearly all the buffers to reach a reference count of
 5 then you're right that we've lost any information about which
 buffers have been referenced most recently.

One possible missing piece here is that OS clock-sweeps depend on the clock 
hand to both increment and decrement the usage count. The hardware sets a bit 
any time a page is accessed; as the clock sweeps in increases usage count if 
the bit is set and decreases it if it's clear. I believe someone else in the 
thread suggested this, and I definitely think it's worth an experiment. 
Presumably this would also ease some lock contention issues.

There is another piece that might be relevant... many (most?) OSes keep 
multiple lists of pages. FreeBSD for example contains these page lists 
(http://www.freebsd.org/doc/en/articles/vm-design/article.html). Full 
description follows, but I think the biggest take-away is that there is a 
difference in how pages are handled once they are no longer active based on 
whither the page is dirty or not.

Active: These pages are actively in use and are not currently under 
consideration for eviction. This is roughy equivalent to all of our buffers 
with a usage count of 5.

When an active page's usage count drops to it's minimum value, it will get 
unmapped from process space and moved to one of two queues:

Inactive: DIRTY pages that are eligible for eviction once they've been written 
out.

Cache: CLEAN pages that may be immediately reclaimed

Free: A small set of pages that are basically the tail of the Cache list. The 
OS *must* maintain some pages on this list to support memory needed during 
interrupt handling. The size of this list is typically kept very small, and I'm 
not sure if non-interrupt processing will pull from this list.

It's important to note that the OS can pull a page back out of the Inactive and 
Cache lists back into Active very cheaply.

I think there are two interesting points here. First: after a page has been 
determined to no longer be in active use it goes into inactive or cache based 
on whether it's dirty. ISTM that allows for much better scheduling of the 
flushing of dirty pages. That said; I'm not sure how much that would help us 
due to checkpoint requirements.

Second: AFAIK only the Active list has a clock sweep. I believe the others are 
LRU (the mentioned URL refers to them as queues). I believe this works well 
because if a page faults it just needs to be removed from whichever queue it is 
in, added to the Active queue, and mapped back into process space.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Simon Riggs
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:
 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

 The whole point of the clock sweep algorithm is to approximate an LRU
 without needing to maintain a linked list. The problem with a linked
 list is that you need to serialize access to it so every time you
 reference a buffer you need to wait on a lock for the list so you can
 move that buffer around in the list.

 Right, but the same doesn't apply to what I'm talking about.  You just
 put each guy into the linked list when his reference count goes down
 to 0.  When you want to allocate a buffer, you pop somebody off.  If
 his reference count is still 0, you forget about him and pop the next
 guy, and keep going until you find a suitable victim.

 The problem with just running the clock sweep every time you need a
 victim buffer is that you may have to pass over a large number of
 unevictable buffers before you find someone you can actually kick out.
  That's work that should really be getting done in the background, not
 at the absolute last moment when we discover, hey, we need a buffer.

I think Greg is right here. Those suggested changes just bring back the LRU.

If you keep a separate list then you need to serialize access to it.

usage_count == 0 and unevictable are dynamic states, so if the
bgwriter sees those states they can change before anyone uses the
buffer.

The only state which is unchangeable is a recently filled block, such
as from a COPY, which is why I originally suggested a
filled-block-list - though I don't think we need it now because of the
buffer cycle strategy.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Simon Riggs
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:

 I agree that
 something's missing.

I'm quoting you completely out of context here, but yes, something is missing.

We can't credibly do one test on usage count in shared buffers and
then start talking about how buffer management is all wrong.

My own patch requires more test evidence before we can commit it,
which is why I hadn't published it before now. I'll endeavour to do
that now its on the table.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Robert Haas
On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:
 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

 The whole point of the clock sweep algorithm is to approximate an LRU
 without needing to maintain a linked list. The problem with a linked
 list is that you need to serialize access to it so every time you
 reference a buffer you need to wait on a lock for the list so you can
 move that buffer around in the list.

 Right, but the same doesn't apply to what I'm talking about.  You just
 put each guy into the linked list when his reference count goes down
 to 0.  When you want to allocate a buffer, you pop somebody off.  If
 his reference count is still 0, you forget about him and pop the next
 guy, and keep going until you find a suitable victim.

 The problem with just running the clock sweep every time you need a
 victim buffer is that you may have to pass over a large number of
 unevictable buffers before you find someone you can actually kick out.
  That's work that should really be getting done in the background, not
 at the absolute last moment when we discover, hey, we need a buffer.

 I think Greg is right here. Those suggested changes just bring back the LRU.

Since the problem with LRU is that it requires moving each buffer to
the front of the list every time it's touched, and since the idea that
I proposed does not require that, I don't know what you mean by this.

 If you keep a separate list then you need to serialize access to it.

That is true.  However, there are some reasons to think it would be
better than what we're doing right now.  Right now, we acquire the
BufFreelistLock, and then take and release some number of spinlocks.
It seems fairly easy to construct cases where that number is routinely
over 100; even on my laptop, I could construct cases where it was
routinely 50-100 range.  A linked list that we can just dequeue things
from could be protected by a single spinlock, which would hopefully be
somewhat faster.  (In the interest of credit where credit is due, this
is pretty much the point Jim Nasby has been making every time this
argument comes up, and I've been dismissing it, but now that I
understand how much work gets done in that clock sweep code, I'm
starting to think he might be right.)

However, if it turns out that that there's still too much contention
over that spinlock, we can probably fix that by maintaining N lists
instead of one, distributing buffers between them in round-robin
fashion, and having each backend choose a list based on its backend ID
modulo N.  The testing I've done so far seems to indicate that
spinlock contention doesn't really become noticeable until you have
32-40 CPUs pounding hard on the same spinlock, so even N = 4 is
probably enough to make the problem go away.  But I don't even think
we're going to have to go that far right at the outset, because
there's another major source of contention in the buffer eviction
path: the buffer mapping locks.

 usage_count == 0 and unevictable are dynamic states, so if the
 bgwriter sees those states they can change before anyone uses the
 buffer.

Yep.  That's already a problem.  When we pin the buffer to be evicted,
we're not yet holding the relevant buffer mapping locks.  By the time
we do, someone else can have pinned the page.  So there's already
retry logic here.  In theory, you can imagine a backend getting
starved for an arbitrarily long period of time, because it keeps
picking a victim buffer that someone else wrenches away before we
actually manage to kick it out.  I haven't yet seen any evidence that
that's a problem in practice, but if it is, this idea will make it
worse.  It seems hard to know without testing it, though.

The big problem with this idea is that it pretty much requires that
the work you mentioned in one of your other emails - separating the
background writer and checkpoint machinery into two separate processes
- to happen first.  So I'm probably going to have to code that up to
see whether this works.  If you're planning to post that patch soon
I'll wait for it.  Otherwise, I'm going to have to cobble together
something that is at least good enough for testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Martijn van Oosterhout
On Sat, Aug 13, 2011 at 09:40:15PM +0100, Greg Stark wrote:
 On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:
 
  and possibly we ought to put them all in a
  linked list so that the next guy who needs a buffer can just pop one
 
 The whole point of the clock sweep algorithm is to approximate an LRU
 without needing to maintain a linked list. The problem with a linked
 list is that you need to serialize access to it so every time you
 reference a buffer you need to wait on a lock for the list so you can
 move that buffer around in the list.

Well, there are such things as lock-free linked lists. Whether they'd
help here is the question though.

http://www.cse.yorku.ca/~ruppert/papers/lfll.pdf

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Simon Riggs
On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas robertmh...@gmail.com wrote:

 The big problem with this idea is that it pretty much requires that
 the work you mentioned in one of your other emails - separating the
 background writer and checkpoint machinery into two separate processes
 - to happen first.  So I'm probably going to have to code that up to
 see whether this works.  If you're planning to post that patch soon
 I'll wait for it.  Otherwise, I'm going to have to cobble together
 something that is at least good enough for testing.

No, the big problem with the idea is that regrettably it is just an
idea on your part and has no basis in external theory or measurement.
I would not object to you investigating such a path and I think you
are someone that could invent something new and original there, but it
seems much less likely to be fruitful, or at least would require
significant experimental results to demonstrate an improvement in a
wide range of use cases to the rest of the hackers.

As to you not being able to work on your idea until I've split
bgwriter/checkpoint, that's completely unnecessary, and you know it. A
single ifdef is sufficient there, if at all.

The path I was working on (as shown in the earlier patch) was to apply
some corrections to the existing algorithm to reduce its worst case
behaviour. That's something I've seen mention of people doing for
RedHat kernels.

Overall, I think a minor modification is the appropriate path. If
Linux or other OS already use ClockPro then we already benefit from
it. It seems silly to track blocks that recently left shared buffers
when they are very likely still actually in memory in the filesystem
cache.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Simon Riggs
On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:
 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

 The whole point of the clock sweep algorithm is to approximate an LRU
 without needing to maintain a linked list. The problem with a linked
 list is that you need to serialize access to it so every time you
 reference a buffer you need to wait on a lock for the list so you can
 move that buffer around in the list.

 Right, but the same doesn't apply to what I'm talking about.  You just
 put each guy into the linked list when his reference count goes down
 to 0.  When you want to allocate a buffer, you pop somebody off.  If
 his reference count is still 0, you forget about him and pop the next
 guy, and keep going until you find a suitable victim.

 The problem with just running the clock sweep every time you need a
 victim buffer is that you may have to pass over a large number of
 unevictable buffers before you find someone you can actually kick out.
  That's work that should really be getting done in the background, not
 at the absolute last moment when we discover, hey, we need a buffer.

 I think Greg is right here. Those suggested changes just bring back the LRU.

 Since the problem with LRU is that it requires moving each buffer to
 the front of the list every time it's touched, and since the idea that
 I proposed does not require that, I don't know what you mean by this.

 If you keep a separate list then you need to serialize access to it.

 That is true.  However, there are some reasons to think it would be
 better than what we're doing right now.  Right now, we acquire the
 BufFreelistLock, and then take and release some number of spinlocks.
 It seems fairly easy to construct cases where that number is routinely
 over 100; even on my laptop, I could construct cases where it was
 routinely 50-100 range.  A linked list that we can just dequeue things
 from could be protected by a single spinlock, which would hopefully be
 somewhat faster.  (In the interest of credit where credit is due, this
 is pretty much the point Jim Nasby has been making every time this
 argument comes up, and I've been dismissing it, but now that I
 understand how much work gets done in that clock sweep code, I'm
 starting to think he might be right.)

 However, if it turns out that that there's still too much contention
 over that spinlock, we can probably fix that by maintaining N lists
 instead of one, distributing buffers between them in round-robin
 fashion, and having each backend choose a list based on its backend ID
 modulo N.  The testing I've done so far seems to indicate that
 spinlock contention doesn't really become noticeable until you have
 32-40 CPUs pounding hard on the same spinlock, so even N = 4 is
 probably enough to make the problem go away.  But I don't even think
 we're going to have to go that far right at the outset, because
 there's another major source of contention in the buffer eviction
 path: the buffer mapping locks.

All of the above presumes that we have a list of best-next-victims. We
don't have that list, so describing how we would optimise access to it
means nothing.

Yes, if we had it, we could dequeue them quickly - that is not the problem.

The problem is that creating the best-next-victim list AND making it
accurate is the act that causes contention.

Yes, we can maintain an inaccurate list cheaply.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Robert Haas
On Sun, Aug 14, 2011 at 10:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas robertmh...@gmail.com wrote:
 The big problem with this idea is that it pretty much requires that
 the work you mentioned in one of your other emails - separating the
 background writer and checkpoint machinery into two separate processes
 - to happen first.  So I'm probably going to have to code that up to
 see whether this works.  If you're planning to post that patch soon
 I'll wait for it.  Otherwise, I'm going to have to cobble together
 something that is at least good enough for testing.

 No, the big problem with the idea is that regrettably it is just an
 idea on your part and has no basis in external theory or measurement.
 I would not object to you investigating such a path and I think you
 are someone that could invent something new and original there, but it
 seems much less likely to be fruitful, or at least would require
 significant experimental results to demonstrate an improvement in a
 wide range of use cases to the rest of the hackers.

All right, well, I'll mull over whether it's worth pursuing.  Unless I
or someone else comes up with an idea I like better, I think it
probably is.

 As to you not being able to work on your idea until I've split
 bgwriter/checkpoint, that's completely unnecessary, and you know it. A
 single ifdef is sufficient there, if at all.

Hmm.  Well, it might be unnecessary, but if I knew it were
unnecessary, I wouldn't have said that I thought it was necessary.

 The path I was working on (as shown in the earlier patch) was to apply
 some corrections to the existing algorithm to reduce its worst case
 behaviour. That's something I've seen mention of people doing for
 RedHat kernels.

Yeah.  Your idea is appealing because it bounds the amount of time .
There is some chance that you might kick out a really hot buffer if
there are a long series of such buffers in a row.  Without testing, I
don't know whether that's a serious problem or not.

 Overall, I think a minor modification is the appropriate path. If
 Linux or other OS already use ClockPro then we already benefit from
 it. It seems silly to track blocks that recently left shared buffers
 when they are very likely still actually in memory in the filesystem
 cache.

You may be right.  Basically, my concern is that buffer eviction is
too slow.  On a 32-core system, it's easy to construct a workload
where the whole system bottlenecks on the rate at which buffers can be
evicted and replaced - not because the system is fundamentally
incapable of copying data around that quickly, but because everything
piles up behind BufFreelistLock, and to a lesser extent the buffer
mapping locks.  Your idea may help with that, but I doubt that it's a
complete solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree that something's missing.

 I'm quoting you completely out of context here, but yes, something is missing.

 We can't credibly do one test on usage count in shared buffers and
 then start talking about how buffer management is all wrong.

More generally: the originally presented facts suggest that there might
be value in improving the buffer access strategy code that keeps
particular operations from using all of shared_buffers.  It seems to me
to be a giant and unsubstantiated leap from that to the conclusion that
there's anything wrong with the clock sweep algorithm.  Moreover,
several of the proposed fixes amount to reversion to methods that
we already know are less good than the clock sweep, because we already
tried them years ago.  So I've been quite unimpressed with the quality
of discussion in this thread.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Robert Haas
On Sun, Aug 14, 2011 at 1:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas robertmh...@gmail.com wrote:
 I agree that something's missing.

 I'm quoting you completely out of context here, but yes, something is 
 missing.

 We can't credibly do one test on usage count in shared buffers and
 then start talking about how buffer management is all wrong.

 More generally: the originally presented facts suggest that there might
 be value in improving the buffer access strategy code that keeps
 particular operations from using all of shared_buffers.

Possibly.  Since I've realized that we only switch to a ring buffer
when the size of the relation is more than 25% of shared buffers, I'm
less concerned about that problem.  My test case only demonstrated a
~20% performance improvement from getting rid of the ring buffer, and
that was with a relation that happened to be 27% of the size of
shared_buffers, so it was a bit unlucky.  I think it'd be interesting
to try to make this smarter in some way, but it's not bugging me as
much now that I've realized that I was unlucky to fall down that
particular well.

 It seems to me
 to be a giant and unsubstantiated leap from that to the conclusion that
 there's anything wrong with the clock sweep algorithm.  Moreover,
 several of the proposed fixes amount to reversion to methods that
 we already know are less good than the clock sweep, because we already
 tried them years ago.  So I've been quite unimpressed with the quality
 of discussion in this thread

Well, here's the problem I'm worried about: if 99% of shared_buffers
is filled with a very hot working set, every new page that gets
brought in will need to scan, on average, 100 buffers before finding
something to evict.  That seems slow.  Simon is proposing to bound the
really bad case where you flip through the entire ring multiple times
before you find a buffer, and that may well be worth doing.  But I
think even scanning 100 buffers every time you need to bring something
in is too slow.  What's indisputable is that a SELECT-only workload
which is larger than shared_buffers can be very easily rate-limited by
the speed at which BufFreelistLock can be taken and released.  If you
have a better idea for solving that problem, I'm all ears...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-14 Thread Greg Smith

On 08/12/2011 10:51 PM, Greg Stark wrote:

If you execute a large batch delete or update or even just set lots of
hint bits you'll dirty a lot of buffers. The ring buffer forces the
query that is actually dirtying all these buffers to also do the i/o
to write them out. Otherwise you leave them behind to slow down other
queries. This was one of the problems with the old vacuum code which
the ring buffer replaced. It left behind lots of dirtied buffers for
other queries to do i/o on.
   


I ran into the other side of this when trying to use Linux's relatively 
new dirty_background_bytes tunable to constrain the OS write cache.  
When running with the current VACUUM ring buffer code, if there isn't 
also a large OS write cache backing that, performance suffers quite a 
bit.  I've been adding test rigging to quantify this into pgbench-tools 
recently, and I fear that one of the possible outcomes could pushback 
pressure toward making the database's ring buffer bigger.  Just a 
theory--waiting on some numbers still.


Anyway, I think every idea thrown out here so far needs about an order 
of magnitude more types of benchmarking test cases before it can be 
evaluated at all.  The case I just mentioned is a good example of why.  
Every other test I ran showed a nice improvement with the kernel tuning 
I tried.  But VACUUM was massively detuned in the process.


I have an entire file folder filled with notes on way to reorganize the 
buffer cache, from my background writer work for 8.3.  In my mind 
they're all sitting stuck behind the problem of getting enough 
standardized benchmark workloads to have a performance regression 
suite.  The idea of touching any of this code without a look at a large 
number of different tests is a bit optimistic.  What I expect to happen 
here that all initially proposed changes will end up tuning for one 
workload at the expense of other, not measured ones.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-13 Thread Robert Haas
On Fri, Aug 12, 2011 at 10:51 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:
 Only 96 of the 14286 buffers in sample_data are in shared_buffers,
 despite the fact that we have 37,218 *completely unused* buffers lying
 around.  That sucks, because it means that the sample query did a
 whole lot of buffer eviction that was completely needless.  The ring
 buffer is there to prevent trashing the shared buffer arena, but here
 it would have been fine to trash the arena: there wasn't anything
 there we would have minded losing (or, indeed, anything at all).

 I don't disagree with the general thrust of your point, but I just
 wanted to point out one case where not using free buffers even though
 they're available might make sense:

 If you execute a large batch delete or update or even just set lots of
 hint bits you'll dirty a lot of buffers. The ring buffer forces the
 query that is actually dirtying all these buffers to also do the i/o
 to write them out. Otherwise you leave them behind to slow down other
 queries. This was one of the problems with the old vacuum code which
 the ring buffer replaced. It left behind lots of dirtied buffers for
 other queries to do i/o on.

Interesting point.

After thinking about this some more, I'm coming around to the idea
that we need to distinguish between:

1. Ensuring a sufficient supply of evictable buffers, and
2. Evicting a buffer.

The second obviously needs to be done only when needed, but the first
one should really be done as background work.  Currently, the clock
sweep serves both functions, and that's not good.  We shouldn't ever
let ourselves get to the point where there are no buffers at all with
reference count zero, so that the next guy who needs a buffer has to
spin the clock hand around until the reference counts get low enough.
Maintaining a sufficient supply of refcount-zero buffers should be
done as a background task; and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop one
off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-13 Thread Greg Stark
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:

 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

The whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.

It does kind of seem like your numbers indicate we're missing part of
the picture though. The idea with the clock sweep algorithm is that
you keep approximately 1/nth of the buffers with each of the n values.
If we're allowing nearly all the buffers to reach a reference count of
5 then you're right that we've lost any information about which
buffers have been referenced most recently.

I wonder if we're suppoesd to be doing something like advancing the
clock hand each time we increment a reference count as well?


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-13 Thread Robert Haas
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas robertmh...@gmail.com wrote:
 and possibly we ought to put them all in a
 linked list so that the next guy who needs a buffer can just pop one

 The whole point of the clock sweep algorithm is to approximate an LRU
 without needing to maintain a linked list. The problem with a linked
 list is that you need to serialize access to it so every time you
 reference a buffer you need to wait on a lock for the list so you can
 move that buffer around in the list.

Right, but the same doesn't apply to what I'm talking about.  You just
put each guy into the linked list when his reference count goes down
to 0.  When you want to allocate a buffer, you pop somebody off.  If
his reference count is still 0, you forget about him and pop the next
guy, and keep going until you find a suitable victim.

The problem with just running the clock sweep every time you need a
victim buffer is that you may have to pass over a large number of
unevictable buffers before you find someone you can actually kick out.
 That's work that should really be getting done in the background, not
at the absolute last moment when we discover, hey, we need a buffer.

 It does kind of seem like your numbers indicate we're missing part of
 the picture though. The idea with the clock sweep algorithm is that
 you keep approximately 1/nth of the buffers with each of the n values.
 If we're allowing nearly all the buffers to reach a reference count of
 5 then you're right that we've lost any information about which
 buffers have been referenced most recently.

It doesn't seem right to have 1/nth of the buffers at each of n values
because that might not match the actual workload.  For example, you
might have lightning-hot buffers that fill 50% of your available
cache.  Trying to artificially force some of those down to usage count
4 or 3 doesn't actually get you anywhere.  In fact, AFAICS, there's no
direct reason to care about about how many buffers have usage count 1
vs. usage count 5.  What IS important for performance is that there
are enough with usage count 0.  Any other distinction is only useful
to the extent that it allows us to make a better decision about which
buffers we should push down to 0 next.

 I wonder if we're suppoesd to be doing something like advancing the
 clock hand each time we increment a reference count as well?

That precise thing seems far too expensive, but I agree that
something's missing.  Right now we decrease usage counts when the
clock hand moves (which is driven by buffer allocation), and we
increase them when buffers are pinned (which is driven by access to
already-resident buffers).  I feel like that's comparing apples and
oranges.  If some subset of shared_buffers is very hot relative to the
uncached pages, then everything's going to converge to 5 (or whatever
arbitrary maximum we care to set in lieu of 5).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 On
 the other hand, the buffer manager has *no problem at all* trashing
 the buffer arena if we're faulting in pages for an index scan rather
 than a sequential scan.  If you manage to get all of sample_data into
 memory (by running many copies of the above query in parallel, you can
 get each one to allocate its own ring buffer, and eventually pull in
 all the pages), and then run some query that probes an index which is
 too large to fit in shared_buffers, it cheerfully blows the whole
 sample_data table out without a second thought.  Had you sequentially
 scanned a big table, of course, there would be some protection, but an
 index scan can stomp all over everything with complete impunity.

That's a good observation and I think we should do this

* Make an IndexScan use a ring buffer once it has used 32 blocks. The
vast majority won't do that, so we avoid overhead on the common path.

* Make an BitmapIndexScan use a ring buffer when we know that the
index is larger than 32 blocks. (Ignore upper parts of tree for that
calc).


-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
 usagecount |  sum
 +---
          1 |   136
          2 |    12
          3 |    72
          4 |     7
          5 | 13755
            | 37218
 (6 rows)

 Only 96 of the 14286 buffers in sample_data are in shared_buffers,
 despite the fact that we have 37,218 *completely unused* buffers lying
 around.  That sucks, because it means that the sample query did a
 whole lot of buffer eviction that was completely needless.  The ring
 buffer is there to prevent trashing the shared buffer arena, but here
 it would have been fine to trash the arena: there wasn't anything
 there we would have minded losing (or, indeed, anything at all).

You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.

Also, your logic is a little off, since you're doing the scan on an
otherwise quiet machine soon after startup and there are some
completely unused buffers. That isn't the typical state of the buffer
cache and so spoiling the cache is not acceptable in the general case.
The above case doesn't suck because it carefully reserved parts of
shared buffers for other users when spoiling the cache would be of no
benefit to the user, so it worked great from my perspective.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 The general problem here is that we are not very smart about handling
 workloads with weak locality - i.e. the working set is larger than
 shared buffers.  If the working set fits in shared_buffers, we will
 keep it there, and it will be strongly protected against eviction.
 But if the working set *doesn't* fit in shared buffers, then we fall
 back on some not-very-smart heuristics to decide what to do: keep
 pages involved in index scans, ditch those involved in sequential
 scans.

 It seems to me that we need a smarter algorithm.  It seems that NetBSD
 has adopted the use of Clock-Pro, and there are some discussions of it
 being used in Linux as well, though I'm not sure whether that's
 actually (still?) the case.  Paper is here, although I find the
 description of the algorithm to be approximately clear as mud:

 http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf

 One of the important notions which many of the algorithms in the
 literature seem to hit on in one way or another is that you mustn't go
 and decide that all of your buffers - or nearly all your buffers - are
 hot.  That's really equivalent to saying that none of your buffers are
 hot; and it's also pretty easy to see that such a scenario would be
 disastrous for our current algorithm, which - if everything in
 shared_buffers has usage-count 5 all the time - will have to clock
 sweep a long way each time it needs to allocate a buffer.  In fact,
 the more of your buffers want to be hot (and thus hard to evict), the
 fewer of them should actually be allowed to acquire that status.

This is something I've been actively working on. I was on the trail of
possible reasons that increasing the size of shared buffers would slow
down PostgreSQL.

The worst case behaviour of the current freelist code is that it can
take up to 5 * shared_buffers checks before identifying a victim
buffer. That occurs when we have a working set exactly matching size
of shared buffers. There are problems when large portions of shared
buffers are very popular, so that the free-ish buffers are a small
proportion of the total buffers - this case gets worse if the
distribution of buffer allocations is non-random so you get say 10
free-ish blocks together, followed by N-10 very popular ones. That
makes every 11th freelist request take ~O(shared_buffers). These
problems will show themselves as sporadic BufFreelistLock contention.
Sporadic is the problem here since it may make the contention hard to
measure in aggregate, yet with real impact on performance.

For us to benefit from increased shared_buffers we need to have an
algorithm that is O(k) in its worst case behaviour and O(1) in its
best case behaviour - the latter aspect is provided by a correctly
functioning bgwriter. At the moment, we do nothing to enforce O(k)
behaviour.

The following patch implements O(k) behaviour using a heuristic limit.
My first attempt was a fixed heuristic limit, but that was less
suitable because there are actually 2 requirements. The value of k
must be small to have a measurable impact on the smoothness of
StrategyGetBuffer(), but k must also be fairly large to ensure the
choice of victim is a relatively good one. So the way to balance those
conflicting objectives is to have a progressively increasing search
window. An just to repeat, k is not a % of shared buffers, which would
still give O(N) behaviour.

I've not been reading the literature, given the problems we had in
2004/5 regarding patents in this area. I also think that since we rely
on the underlying filesystem for cacheing that we don't have exactly
the same problem as other systems.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


freelist_ok.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs
On Fri, Aug 12, 2011 at 11:53 AM, Simon Riggs si...@2ndquadrant.com wrote:

 I've not been reading the literature, given the problems we had in
 2004/5 regarding patents in this area. I also think that since we rely
 on the underlying filesystem for cacheing that we don't have exactly
 the same problem as other systems.

Reading the link you gave, I'm unimpressed by ClockPro. The
measurements made are all in low numbers of MB and the results show
that Clock is roughly same or better for large caches, but one might
read them multiple ways.

The zone of interest for us is above 1GB and the issues we care about
are contention, so even if we think ClockPro has a chance of improving
things we really don't have any measurements that support the cases we
care about.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas
On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You're missing an important point. The SeqScan is measurably faster
 when using the ring buffer because of the effects of L2 cacheing on
 the buffers.

I hadn't thought of that, but I think that's only true if the relation
won't fit in shared_buffers (or whatever portion of shared_buffers is
reasonably available, given the other activity on the machine).  In
this particular case, it's almost 20% faster if the relation is all in
shared_buffers; I tested it.  I think what's going on here is that
initscan() has a heuristic that tries to use a BufferAccessStrategy if
the relation is larger than 1/4 of shared_buffers.  That's probably a
pretty good heuristic in general, but in this case I have a relation
which just so happens to be 27.9% of shared_buffers but will still
fit.  As you say below, that may not be typical in real life, although
there are probably data warehousing systems where it's normal to have
only one big query running at a time.

 Also, your logic is a little off, since you're doing the scan on an
 otherwise quiet machine soon after startup and there are some
 completely unused buffers. That isn't the typical state of the buffer
 cache and so spoiling the cache is not acceptable in the general case.
 The above case doesn't suck because it carefully reserved parts of
 shared buffers for other users when spoiling the cache would be of no
 benefit to the user, so it worked great from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas
On Fri, Aug 12, 2011 at 4:36 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:

 On
 the other hand, the buffer manager has *no problem at all* trashing
 the buffer arena if we're faulting in pages for an index scan rather
 than a sequential scan.  If you manage to get all of sample_data into
 memory (by running many copies of the above query in parallel, you can
 get each one to allocate its own ring buffer, and eventually pull in
 all the pages), and then run some query that probes an index which is
 too large to fit in shared_buffers, it cheerfully blows the whole
 sample_data table out without a second thought.  Had you sequentially
 scanned a big table, of course, there would be some protection, but an
 index scan can stomp all over everything with complete impunity.

 That's a good observation and I think we should do this

 * Make an IndexScan use a ring buffer once it has used 32 blocks. The
 vast majority won't do that, so we avoid overhead on the common path.

 * Make an BitmapIndexScan use a ring buffer when we know that the
 index is larger than 32 blocks. (Ignore upper parts of tree for that
 calc).

We'd need to think about what happens to the internal pages of the
btree, the leaf pages, and then the heap pages from the underlying
relation; those probably shouldn't all be treated the same.  Also, I
think the tricky part is figuring out when to apply the optimization
in the first place.  Once we decide we need a ring buffer, a very
small one (like 32 blocks) is probably the way to go.  But it will be
a loser to apply the optimization to data sets that would otherwise
have fit in shared_buffers.  This is a classic case of the LRU/MRU
problem.  You want to evict buffers in LRU fashion until the working
set gets larger than you can cache; and then you want to switch to MRU
to avoid uselessly caching pages that you'll never manage to revisit
before they're evicted.

The point of algorithms like Clock-Pro is to try to have the system
work that out on the fly, based on the actual workload, rather than
using heuristics.  I agree with you there's no convincing evidence
that Clock-Pro would be better for us; I mostly thought it was
interesting because it seems that the NetBSD and Linux guys find it
interesting, and they're managing much larger caches than the ones
we're dealing with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs
On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You're missing an important point. The SeqScan is measurably faster
 when using the ring buffer because of the effects of L2 cacheing on
 the buffers.

 I hadn't thought of that, but I think that's only true if the relation
 won't fit in shared_buffers (or whatever portion of shared_buffers is
 reasonably available, given the other activity on the machine).  In
 this particular case, it's almost 20% faster if the relation is all in
 shared_buffers; I tested it.  I think what's going on here is that
 initscan() has a heuristic that tries to use a BufferAccessStrategy if
 the relation is larger than 1/4 of shared_buffers.  That's probably a
 pretty good heuristic in general, but in this case I have a relation
 which just so happens to be 27.9% of shared_buffers but will still
 fit.  As you say below, that may not be typical in real life, although
 there are probably data warehousing systems where it's normal to have
 only one big query running at a time.

I think there are reasonable arguments to make

* prefer_cache = off (default) | on a table level storage parameter,
=on will disable the use of BufferAccessStrategy

* make cache_spoil_threshold a parameter, with default 0.25

Considering the world of very large RAMs in which we now live, some
control of the above makes sense.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas
On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The worst case behaviour of the current freelist code is that it can
 take up to 5 * shared_buffers checks before identifying a victim
 buffer. That occurs when we have a working set exactly matching size
 of shared buffers. There are problems when large portions of shared
 buffers are very popular, so that the free-ish buffers are a small
 proportion of the total buffers - this case gets worse if the
 distribution of buffer allocations is non-random so you get say 10
 free-ish blocks together, followed by N-10 very popular ones. That
 makes every 11th freelist request take ~O(shared_buffers). These
 problems will show themselves as sporadic BufFreelistLock contention.
 Sporadic is the problem here since it may make the contention hard to
 measure in aggregate, yet with real impact on performance.

I've been thinking about this exact same set of problems.

 For us to benefit from increased shared_buffers we need to have an
 algorithm that is O(k) in its worst case behaviour and O(1) in its
 best case behaviour - the latter aspect is provided by a correctly
 functioning bgwriter. At the moment, we do nothing to enforce O(k)
 behaviour.

Check.

 The following patch implements O(k) behaviour using a heuristic limit.
 My first attempt was a fixed heuristic limit, but that was less
 suitable because there are actually 2 requirements. The value of k
 must be small to have a measurable impact on the smoothness of
 StrategyGetBuffer(), but k must also be fairly large to ensure the
 choice of victim is a relatively good one. So the way to balance those
 conflicting objectives is to have a progressively increasing search
 window. An just to repeat, k is not a % of shared buffers, which would
 still give O(N) behaviour.

This is a very interesting idea, and I think it has a lot of potential.

One additional thought I had is that right now I think it's far too
easy for buffers to get pushed all the way up to usage count 5,
because we bump the reference count every time the buffer is pinned,
which can easily happen many times per clock sweep.  I think we would
be better off having a used flag on each buffer.  Each pin sets the
used bit, but does nothing if it is already set.  The usage count is
only changed by the clock sweep, which does the following on each
buffer:

- If the used flag is set, then the flag is cleared and the usage
count increases by one to a maximum of 5.
- If the used flag is not set, then the usage count decreases by one.

That way, buffers can't get hot because of a fast flurry of access
followed by nothing - to get up to usage_count 5, they've actually got
to stay interesting for a period of time.  As a side effect, I believe
we could get rid of the spinlock that we currently take and release
for every buffer the clock sweep hits, because we could regard the
usage_count as protected by the BufFreelistLock rather than by the
buffer header spinlock; and the used flag doesn't require perfect
synchronization.  We'd only pin the buffer we actually plan to evict
(and could recheck the used flag before doing so, in case a memory
ordering effect made us miss the fact that it had been recently set).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Simon Riggs
On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote:

  But it will be
 a loser to apply the optimization to data sets that would otherwise
 have fit in shared_buffers.

Spoiling the cache is a bad plan, even if it makes the current query faster.

I think we should make the optimisation stronger still and use
FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
OS cache is a problem as well.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas
On Fri, Aug 12, 2011 at 8:28 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 You're missing an important point. The SeqScan is measurably faster
 when using the ring buffer because of the effects of L2 cacheing on
 the buffers.

 I hadn't thought of that, but I think that's only true if the relation
 won't fit in shared_buffers (or whatever portion of shared_buffers is
 reasonably available, given the other activity on the machine).  In
 this particular case, it's almost 20% faster if the relation is all in
 shared_buffers; I tested it.  I think what's going on here is that
 initscan() has a heuristic that tries to use a BufferAccessStrategy if
 the relation is larger than 1/4 of shared_buffers.  That's probably a
 pretty good heuristic in general, but in this case I have a relation
 which just so happens to be 27.9% of shared_buffers but will still
 fit.  As you say below, that may not be typical in real life, although
 there are probably data warehousing systems where it's normal to have
 only one big query running at a time.

 I think there are reasonable arguments to make

 * prefer_cache = off (default) | on a table level storage parameter,
 =on will disable the use of BufferAccessStrategy

 * make cache_spoil_threshold a parameter, with default 0.25

Yeah, something like that might make sense.  Of course, a completely
self-tuning system would be better, but I'm not sure we're going to
find one of those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Robert Haas
On Fri, Aug 12, 2011 at 8:35 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote:
  But it will be
 a loser to apply the optimization to data sets that would otherwise
 have fit in shared_buffers.

 Spoiling the cache is a bad plan, even if it makes the current query faster.

 I think we should make the optimisation stronger still and use
 FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
 OS cache is a problem as well.

That would probably be better for really big tables, but it will be
worse for those where the entire table fits in the OS cache.

Caching spoiling means you're putting things into the cache which
won't still be there the next time they're needed.  If the data in
question fits in cache (and I don't think we can regard that as
uncommon, particularly for the OS cache, which can be half a terabyte)
then you don't want the anti-spoiling logic to kick in.

One thing we could consider for large sequential scans is doing
POSIX_FADV_SEQUENTIAL before beginning the scan (and maybe one more
time if the scan wraps).  That's basically throwing the problem of
whether or not to go LRU or MRU back on the OS, but the OS may well
have a better idea whether there's enough cache available to hold the
whole than we do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread daveg
On Fri, Aug 12, 2011 at 01:28:49PM +0100, Simon Riggs wrote:
 I think there are reasonable arguments to make
 
 * prefer_cache = off (default) | on a table level storage parameter,
 =on will disable the use of BufferAccessStrategy
 
 * make cache_spoil_threshold a parameter, with default 0.25
 
 Considering the world of very large RAMs in which we now live, some
 control of the above makes sense.

As long as we are discussion cache settings for tables, I have a client
who would like to be able to lock specific tables and indexes into cache
as they have strict response time requirements for particular queries.
At the moment they are running postgres with a tablespace on ramfs and
taking frequent backups, but this is not optimal.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] our buffer replacement strategy is kind of lame

2011-08-12 Thread Greg Stark
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas robertmh...@gmail.com wrote:
 Only 96 of the 14286 buffers in sample_data are in shared_buffers,
 despite the fact that we have 37,218 *completely unused* buffers lying
 around.  That sucks, because it means that the sample query did a
 whole lot of buffer eviction that was completely needless.  The ring
 buffer is there to prevent trashing the shared buffer arena, but here
 it would have been fine to trash the arena: there wasn't anything
 there we would have minded losing (or, indeed, anything at all).

I don't disagree with the general thrust of your point, but I just
wanted to point out one case where not using free buffers even though
they're available might make sense:

If you execute a large batch delete or update or even just set lots of
hint bits you'll dirty a lot of buffers. The ring buffer forces the
query that is actually dirtying all these buffers to also do the i/o
to write them out. Otherwise you leave them behind to slow down other
queries. This was one of the problems with the old vacuum code which
the ring buffer replaced. It left behind lots of dirtied buffers for
other queries to do i/o on.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] our buffer replacement strategy is kind of lame

2011-08-11 Thread Robert Haas
While I was poking around at the index-only scans patch, I couldn't
help noticing that our buffer replacement algorithm leaves something
to be desired.  Here's the query again:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

As sample_data is not indexed, it sequential scans that table and does
an index-only scan of pgbench_accounts for each aid.  When this is
done, here's what pg_buffercache has to say:

rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
usagecount |  sum
+---
  1 |   136
  2 |12
  3 |72
  4 | 7
  5 | 13755
| 37218
(6 rows)

Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around.  That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless.  The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all).  On
the other hand, the buffer manager has *no problem at all* trashing
the buffer arena if we're faulting in pages for an index scan rather
than a sequential scan.  If you manage to get all of sample_data into
memory (by running many copies of the above query in parallel, you can
get each one to allocate its own ring buffer, and eventually pull in
all the pages), and then run some query that probes an index which is
too large to fit in shared_buffers, it cheerfully blows the whole
sample_data table out without a second thought.  Had you sequentially
scanned a big table, of course, there would be some protection, but an
index scan can stomp all over everything with complete impunity.

The general problem here is that we are not very smart about handling
workloads with weak locality - i.e. the working set is larger than
shared buffers.  If the working set fits in shared_buffers, we will
keep it there, and it will be strongly protected against eviction.
But if the working set *doesn't* fit in shared buffers, then we fall
back on some not-very-smart heuristics to decide what to do: keep
pages involved in index scans, ditch those involved in sequential
scans.

It seems to me that we need a smarter algorithm.  It seems that NetBSD
has adopted the use of Clock-Pro, and there are some discussions of it
being used in Linux as well, though I'm not sure whether that's
actually (still?) the case.  Paper is here, although I find the
description of the algorithm to be approximately clear as mud:

http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf

One of the important notions which many of the algorithms in the
literature seem to hit on in one way or another is that you mustn't go
and decide that all of your buffers - or nearly all your buffers - are
hot.  That's really equivalent to saying that none of your buffers are
hot; and it's also pretty easy to see that such a scenario would be
disastrous for our current algorithm, which - if everything in
shared_buffers has usage-count 5 all the time - will have to clock
sweep a long way each time it needs to allocate a buffer.  In fact,
the more of your buffers want to be hot (and thus hard to evict), the
fewer of them should actually be allowed to acquire that status.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers