Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-22 Thread Merlin Moncure
On Tue, Aug 20, 2013 at 1:57 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-08-19 15:17:44 -0700, Jeff Janes wrote:
 On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote:

  I agree; at least then it's not unambiguously better. if you (in
  effect) swap all contention on allocation from a lwlock to a spinlock
  it's not clear if you're improving things; it would have to be proven
  and I'm trying to keep things simple.
 
  Attached is a scaled down version of the patch that keeps the freelist
  lock but still removes the spinlock during the clock sweep.  This
  still hits the major objectives of reducing the chance of scheduling
  out while holding the BufFreelistLock and mitigating the worst case
  impact of doing so if it does happen.  An even more scaled down
  version would keep the current logic exactly as is except for
  replacing buffer lock in the clock sweep with a trylock (which is
  IMNSHO a no-brainer).

 Since usage_count is unsigned, are you sure that changing the tests
 from buf-usage_count == 0 to buf-usage_count = 0 accomplishes
 what you need it to?  If usage_count gets decremented when it already
 zero, it will wrap around to 65,535, at least on some compilers some
 of the time, won't it?

 Overflow of *unsigned* variables is actually defined and will always
 wrap around. It's signed variables which don't have such a clear
 behaviour.


Hm, well, even better would be to leave things as they are and try to
guarantee that usage_count is updated via assignment vs increment;
that way it would be impossible to wander out of bounds.  I bet
changing:
i--; to i=(i-1);

isn't going to do much against modern compilers.  But what about
assignment from a volatile temporary?

volatile v = usage_count;
if (v  0) v--;
usage_count = v;

something like that.  Or maybe declaring usage_count as volatile might
be enough?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-20 Thread Andres Freund
On 2013-08-19 15:17:44 -0700, Jeff Janes wrote:
 On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote:
 
  I agree; at least then it's not unambiguously better. if you (in
  effect) swap all contention on allocation from a lwlock to a spinlock
  it's not clear if you're improving things; it would have to be proven
  and I'm trying to keep things simple.
 
  Attached is a scaled down version of the patch that keeps the freelist
  lock but still removes the spinlock during the clock sweep.  This
  still hits the major objectives of reducing the chance of scheduling
  out while holding the BufFreelistLock and mitigating the worst case
  impact of doing so if it does happen.  An even more scaled down
  version would keep the current logic exactly as is except for
  replacing buffer lock in the clock sweep with a trylock (which is
  IMNSHO a no-brainer).
 
 Since usage_count is unsigned, are you sure that changing the tests
 from buf-usage_count == 0 to buf-usage_count = 0 accomplishes
 what you need it to?  If usage_count gets decremented when it already
 zero, it will wrap around to 65,535, at least on some compilers some
 of the time, won't it?

Overflow of *unsigned* variables is actually defined and will always
wrap around. It's signed variables which don't have such a clear
behaviour.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-19 Thread Merlin Moncure
On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure mmonc...@gmail.com wrote:
 *) What I think is happening:
 I think we are again getting burned by getting de-scheduled while
 holding the free list lock. I've been chasing this problem for a long
 time now (for example, see:
 http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
 but not I've got a reproducible case.  What is happening this:

 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
 3. Lock free list, go into clock sweep loop
 4. while holding clock sweep, hit 'hot' buffer, spin on it
 5. get de-scheduled
 6. now enter the 'hot buffer spin lock lottery'
 7. more/more backends pile on, linux scheduler goes bezerk, reducing
 chances of winning #6
 8. finally win the lottery. lock released. everything back to normal.

 This is an interesting theory, but where's the evidence?  I've seen
 spinlock contention come from enough different places to be wary of
 arguments that start with it must be happening because

Absolutely.  My evidence is circumstantial at best -- let's call it a
hunch.  I also do not think we are facing pure spinlock contention,
but something more complex which is a combination of spinlocks, the
free list lwlock, and the linux scheduler.  This problem showed up
going from RHEL 5-6 which brought a lot of scheduler changes.  A lot
of other things changed too, but the high sys cpu really suggests we
are getting some feedback from the scheduler.

 IMHO, the thing to do here is run perf record -g during one of the
 trouble periods.  The performance impact is quite low.  You could
 probably even set up a script that runs perf for five minute intervals
 at a time and saves all of the perf.data files.  When one of these
 spikes happens, grab the one that's relevant.

Unfortunately -- that's not on the table.  Dropping shared buffers to
2GB (thanks RhodiumToad) seems to have fixed the issue and there is
zero chance I will get approval to revert that setting in order to
force this to re-appear.  So far, I have not been able to reproduce in
testing.   By the way, this problem has popped up in other places too;
and the typical remedies are applied until it goes away :(.

 If you see that s_lock is where all the time is going, then you've
 proved it's a PostgreSQL spinlock rather than something in the kernel
 or a shared library.  If you can further see what's calling s_lock
 (which should hopefully be possible with perf -g), then you've got it
 nailed dead to rights.

Well, I don't think it's that simple.  So my plan of action is this:
1) Improvise a patch that removes one *possible* trigger for the
problem, or at least makes it much less likely to occur.  Also, in
real world cases where usage_count is examined N times before
returning a candidate buffer, the amount of overall spinlocking from
buffer allocating is reduced by approximately (N-1)/N.  Even though
spin locking is cheap, it's hard to argue with that...

2) Exhaustively performance test patch #1.  I think this is win-win
since SGB clock sweep loop is quite frankly relatively un-optimized. I
don't see how reducing the amount of locking could hurt performance
but I've been, uh, wrong about these types of things before.

3) If a general benefit without downside is shown from #2, I'll simply
advance the patch for the next CF and see how things shake out. If and
when I feel like there's a decent shot at getting accepted, I may go
through the motions of setting up a patched server in production and
attempting shared buffers.  But that's a long way off.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-19 Thread Jeff Janes
On Mon, Aug 5, 2013 at 8:49 AM, Merlin Moncure mmonc...@gmail.com wrote:

 *) What I think is happening:
 I think we are again getting burned by getting de-scheduled while
 holding the free list lock. I've been chasing this problem for a long
 time now (for example, see:
 http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
 but not I've got a reproducible case.  What is happening this:

 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
 3. Lock free list, go into clock sweep loop
 4. while holding clock sweep, hit 'hot' buffer, spin on it
 5. get de-scheduled

It seems more likely to me that it got descheduled immediately after
obtaining the hot buffer header spinlock but before releasing it,
rather than while still spinning for it.


 6. now enter the 'hot buffer spin lock lottery'
 7. more/more backends pile on, linux scheduler goes bezerk, reducing
 chances of winning #6

Lots and lots of processes (which want the buffer to use, not to
evict) are spinning for the lock held by the descheduled process on
the buffer header, and the scheduler goes bezerk.  A bunch of other
processes are waiting on the relation extension lock, but are doing so
non-bezerk-ly on the semaphore.

Each spinning process puts itself to sleep without having consumed its
slice once it hits spins_per_delay, and so the scheduler keeps
rescheduling them; rather than scheduling the one that is holding the
lock, which consumed its entire slice and so is on the low priority to
reschedule list.

That is my theory, of course.  I'm not sure if it leads to a different
course of action than the theory it has not yet obtained the header
lock.

Any idea what spins_per_delay has converged to?  There seems to be no
way to obtain this info, other than firing up the debugger.  I had a
patch somewhere that has every process elog(LOG,...) the value which
it fetches from shared memory immediately upon start-up.  It is a pain
to hunt down the patch, apply, and compiler and restart the server
every time I want this value.  Maybe something like this could be put
in core with a suitable condition, but I don't know what that
condition would be.  Or would creating a C function with a
pg_spins_per_delay() wrapper function which returns this value on
demand be the way to go?



 Is the analysis and the patch to fix the perceived problem plausible
 without breaking other stuff..  If so, I'm inclined to go further with
 this.   This is not the only solution on the table for high buffer
 contention, but IMNSHO it should get a lot of points for being very
 localized.  Maybe a reduced version could be tried retaining the
 freelist lock but keeping the 'trylock' on the buf header.

My concern is how we can ever move this forward.  If we can't recreate
it on a test system, and you probably won't be allowed to push
experimental patches to the production systemwhat's left?

Also, if the kernel is introducing new scheduling bottlenecks, are we
just playing whack-a-mole by trying to deal with it in PG code?

Stepping back a bit, have you considered a connection pooler to
restrict the number of simultaneously active connections?  It wouldn't
be the first time that that has alleviated the symptoms of these
high-RAM kernel bottlenecks.  (If we are going to play whack-a-mole,
might as well use a hammer that already exists and is well tested)


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-19 Thread Jeff Janes
On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote:

 I agree; at least then it's not unambiguously better. if you (in
 effect) swap all contention on allocation from a lwlock to a spinlock
 it's not clear if you're improving things; it would have to be proven
 and I'm trying to keep things simple.

 Attached is a scaled down version of the patch that keeps the freelist
 lock but still removes the spinlock during the clock sweep.  This
 still hits the major objectives of reducing the chance of scheduling
 out while holding the BufFreelistLock and mitigating the worst case
 impact of doing so if it does happen.  An even more scaled down
 version would keep the current logic exactly as is except for
 replacing buffer lock in the clock sweep with a trylock (which is
 IMNSHO a no-brainer).

Since usage_count is unsigned, are you sure that changing the tests
from buf-usage_count == 0 to buf-usage_count = 0 accomplishes
what you need it to?  If usage_count gets decremented when it already
zero, it will wrap around to 65,535, at least on some compilers some
of the time, won't it?

It seem safer just not to decrement if we can't get the lock.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-19 Thread Merlin Moncure
On Mon, Aug 19, 2013 at 5:02 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 My concern is how we can ever move this forward.  If we can't recreate
 it on a test system, and you probably won't be allowed to push
 experimental patches to the production systemwhat's left?

 Also, if the kernel is introducing new scheduling bottlenecks, are we
 just playing whack-a-mole by trying to deal with it in PG code?

I think I can rest on the basis that there is no good reason to sit
and spin on a buffer during clock sweep -- even if I have to whittle
the patch down to the '1 liner' variant that trylocks the buffer
header.  I also like the idea of getting rid of the freelist lock
completely but I think those ideas can be tested in isolation.  Really
though, the acid-test is going to be exhaustive performance testing.

 Stepping back a bit, have you considered a connection pooler to
 restrict the number of simultaneously active connections?  It wouldn't
 be the first time that that has alleviated the symptoms of these
 high-RAM kernel bottlenecks.  (If we are going to play whack-a-mole,
 might as well use a hammer that already exists and is well tested)

Certainly. It was the very first thing I suggested upon hearing about
this problem.  Unfortunately, when these kinds of things happen in
high scale, high load databases on serious hardware improvising
pgbouncer is simply not possible without first going through extensive
performance and application compatibility testing.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-19 Thread Merlin Moncure
On Mon, Aug 19, 2013 at 5:17 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote:

 I agree; at least then it's not unambiguously better. if you (in
 effect) swap all contention on allocation from a lwlock to a spinlock
 it's not clear if you're improving things; it would have to be proven
 and I'm trying to keep things simple.

 Attached is a scaled down version of the patch that keeps the freelist
 lock but still removes the spinlock during the clock sweep.  This
 still hits the major objectives of reducing the chance of scheduling
 out while holding the BufFreelistLock and mitigating the worst case
 impact of doing so if it does happen.  An even more scaled down
 version would keep the current logic exactly as is except for
 replacing buffer lock in the clock sweep with a trylock (which is
 IMNSHO a no-brainer).

 Since usage_count is unsigned, are you sure that changing the tests
 from buf-usage_count == 0 to buf-usage_count = 0 accomplishes
 what you need it to?  If usage_count gets decremented when it already
 zero, it will wrap around to 65,535, at least on some compilers some
 of the time, won't it?

 It seem safer just not to decrement if we can't get the lock.

Hurk -- well, maybe it should be changed to signed in this
implementation (adjustment w/o lock).

Safer maybe, but you lose a part of the optimization: not having to
spam cache line locks as you constantly spinlock your way around the
clock.  Maybe that doesn't matter much -- I'm inclined to test it both
ways and see -- plus maybe a 3rd variant that manages the freelist
itself with a spinlock as well.

variant A: buffer spinlock - trylock (1 liner!)
variant B: as above, but usage_count manipulation occurs outside of lock
variant C: as above, but dispense with lwlock wholly or in part (plus
possibly stuff the freelist etc).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-17 Thread Robert Haas
On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure mmonc...@gmail.com wrote:
 *) What I think is happening:
 I think we are again getting burned by getting de-scheduled while
 holding the free list lock. I've been chasing this problem for a long
 time now (for example, see:
 http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
 but not I've got a reproducible case.  What is happening this:

 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
 3. Lock free list, go into clock sweep loop
 4. while holding clock sweep, hit 'hot' buffer, spin on it
 5. get de-scheduled
 6. now enter the 'hot buffer spin lock lottery'
 7. more/more backends pile on, linux scheduler goes bezerk, reducing
 chances of winning #6
 8. finally win the lottery. lock released. everything back to normal.

This is an interesting theory, but where's the evidence?  I've seen
spinlock contention come from enough different places to be wary of
arguments that start with it must be happening because

IMHO, the thing to do here is run perf record -g during one of the
trouble periods.  The performance impact is quite low.  You could
probably even set up a script that runs perf for five minute intervals
at a time and saves all of the perf.data files.  When one of these
spikes happens, grab the one that's relevant.

If you see that s_lock is where all the time is going, then you've
proved it's a PostgreSQL spinlock rather than something in the kernel
or a shared library.  If you can further see what's calling s_lock
(which should hopefully be possible with perf -g), then you've got it
nailed dead to rights.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-14 Thread Merlin Moncure
Performance testing this patch is a real bugaboo for me; the VMs I have to
work with are too unstable to give useful results :-(.  Need to scrounge up
a doner box somewhere...


On Tue, Aug 13, 2013 at 12:26 AM, Amit Kapila amit.kapil...@gmail.comwrote:

 Merlin Moncure wrote:
 On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila
 amit(dot)kapila(at)huawei(dot)com wrote:
  -Original Message-
  From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
  owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure
  Sent: Thursday, August 08, 2013 12:09 AM
  To: Andres Freund
  Cc: PostgreSQL-development; Jeff Janes
  Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2
 
  On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund
 andres(at)2ndquadrant(dot)com
  wrote:
   On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

  I have some very strong evidence that the problem is coming out of the
  buffer allocator.  Exhibit A is that vlad's presentation of the
  problem was on a read only load (if not allocator lock, then what?).
  Exhibit B is that lowering shared buffers to 2gb seems to have (so
  far, 5 days in) fixed the issue.  This problem shows up on fast
  machines with fast storage and lots of cores.  So what I think is
  happening is that usage_count starts creeping up faster than it gets
  cleared by the sweep with very large buffer settings which in turn
  causes the 'problem' buffers to be analyzed for eviction more often.
 
Yes one idea which was discussed previously is to not increase usage
  count, every time buffer is pinned.
I am also working on some of the optimizations on similar area, which
 you
  can refer here:
 
 
 http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
  huawei.com

  yup -- just took a quick look at your proposed patch.  You're
  attacking the 'freelist' side of buffer allocation where my stripped
  down patch addresses issues with the clocksweep.  I think this is a
  good idea but more than I wanted to get into personally.

  Good news is that both patches should essentially bolt on together
  AFAICT.

 True, I also think so as both are trying to reduce contention in same area.

   I propose we do a bit of consolidation of performance testing
  efforts and run tests with patch A, B, and AB in various scenarios.  I
  have a 16 core vm (4gb ram) that I can test with and want to start
  with say 2gb database 1gb shared_buffers high concurrency test and see
  how it burns in.  What do you think?

 I think this can mainly benefit with large data  and shared buffers (
 10G), last year also I had ran few tests with similar idea's but
 didn't get much
 in with less shared buffers.

   Are you at a point where we can
  run some tests?

 Not now, but I will try to run before/during next CF.


 With Regards,
 Amit Kapila.
 EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-14 Thread Jim Nasby

On 8/14/13 8:30 AM, Merlin Moncure wrote:

Performance testing this patch is a real bugaboo for me; the VMs I have to work 
with are too unstable to give useful results :-(.  Need to scrounge up a doner 
box somewhere...


I offered a server or two to the community a while ago but I don't think 
anything was ever resolved. It wouldn't have a massive number of cores (only 24 
IIRC), but it would have a lot of memory (definitely over 192G; maybe more).

The community just needs to decide it wants it and where it gets shipped to...
--
Jim Nasby, Lead Data Architect
(512) 569-9461 (primary) (512) 579-9024 (backup)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-14 Thread Amit Kapila
On Wed, Aug 14, 2013 at 7:00 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Performance testing this patch is a real bugaboo for me; the VMs I have to
 work with are too unstable to give useful results :-(.  Need to scrounge up
 a doner box somewhere...

   While doing performance tests in this area, I always had a feeling
that OS layer (scheduler that flushes OS buffers)
   had a role to play here, So I use to reboot m/c between tests,
ofcourse taking performance data in this fashion is tedious.
   I think it is good to run tests on m/c with configuration similar
to what Jim Nasby mentioned in his below mail.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-12 Thread Amit Kapila
Merlin Moncure wrote:
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila
amit(dot)kapila(at)huawei(dot)com wrote:
 -Original Message-
 From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
 owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure
 Sent: Thursday, August 08, 2013 12:09 AM
 To: Andres Freund
 Cc: PostgreSQL-development; Jeff Janes
 Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2

 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund 
 andres(at)2ndquadrant(dot)com
 wrote:
  On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

 I have some very strong evidence that the problem is coming out of the
 buffer allocator.  Exhibit A is that vlad's presentation of the
 problem was on a read only load (if not allocator lock, then what?).
 Exhibit B is that lowering shared buffers to 2gb seems to have (so
 far, 5 days in) fixed the issue.  This problem shows up on fast
 machines with fast storage and lots of cores.  So what I think is
 happening is that usage_count starts creeping up faster than it gets
 cleared by the sweep with very large buffer settings which in turn
 causes the 'problem' buffers to be analyzed for eviction more often.

   Yes one idea which was discussed previously is to not increase usage
 count, every time buffer is pinned.
   I am also working on some of the optimizations on similar area, which you
 can refer here:

 http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
 huawei.com

 yup -- just took a quick look at your proposed patch.  You're
 attacking the 'freelist' side of buffer allocation where my stripped
 down patch addresses issues with the clocksweep.  I think this is a
 good idea but more than I wanted to get into personally.

 Good news is that both patches should essentially bolt on together
 AFAICT.

True, I also think so as both are trying to reduce contention in same area.

  I propose we do a bit of consolidation of performance testing
 efforts and run tests with patch A, B, and AB in various scenarios.  I
 have a 16 core vm (4gb ram) that I can test with and want to start
 with say 2gb database 1gb shared_buffers high concurrency test and see
 how it burns in.  What do you think?

I think this can mainly benefit with large data  and shared buffers (
10G), last year also I had ran few tests with similar idea's but
didn't get much
in with less shared buffers.

  Are you at a point where we can
 run some tests?

Not now, but I will try to run before/during next CF.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-08 Thread Merlin Moncure
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila amit.kap...@huawei.com wrote:
 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
 ow...@postgresql.org] On Behalf Of Merlin Moncure
 Sent: Thursday, August 08, 2013 12:09 AM
 To: Andres Freund
 Cc: PostgreSQL-development; Jeff Janes
 Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2

 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund and...@2ndquadrant.com
 wrote:
  On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
   I don't think the unlocked increment of nextVictimBuffer is a good
 idea
   though. nextVictimBuffer jumping over NBuffers under concurrency
 seems
   like a recipe for disaster to me. At the very, very least it will
 need a
   good wad of comments explaining what it means and how you're
 allowed to
   use it. The current way will lead to at least bgwriter accessing a
   nonexistant/out of bounds buffer via StrategySyncStart().
   Possibly it won't even save that much, it might just increase the
   contention on the buffer header spinlock's cacheline.
 
  I agree; at least then it's not unambiguously better. if you (in
  effect) swap all contention on allocation from a lwlock to a
 spinlock
  it's not clear if you're improving things; it would have to be
 proven
  and I'm trying to keep things simple.
 
  I think converting it to a spinlock actually is a good idea, you just
  need to expand the scope a bit.

 all right: well, I'll work up another version doing full spinlock and
 maybe see things shake out in performance.

  FWIW, I am not convinced this is the trigger for the problems you're
  seing. It's a good idea nonetheless though.

 I have some very strong evidence that the problem is coming out of the
 buffer allocator.  Exhibit A is that vlad's presentation of the
 problem was on a read only load (if not allocator lock, then what?).
 Exhibit B is that lowering shared buffers to 2gb seems to have (so
 far, 5 days in) fixed the issue.  This problem shows up on fast
 machines with fast storage and lots of cores.  So what I think is
 happening is that usage_count starts creeping up faster than it gets
 cleared by the sweep with very large buffer settings which in turn
 causes the 'problem' buffers to be analyzed for eviction more often.

   Yes one idea which was discussed previously is to not increase usage
 count, every time buffer is pinned.
   I am also working on some of the optimizations on similar area, which you
 can refer here:

 http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
 huawei.com

yup -- just took a quick look at your proposed patch.  You're
attacking the 'freelist' side of buffer allocation where my stripped
down patch addresses issues with the clocksweep.  I think this is a
good idea but more than I wanted to get into personally.

Good news is that both patches should essentially bolt on together
AFAICT.  I propose we do a bit of consolidation of performance testing
efforts and run tests with patch A, B, and AB in various scenarios.  I
have a 16 core vm (4gb ram) that I can test with and want to start
with say 2gb database 1gb shared_buffers high concurrency test and see
how it burns in.  What do you think?   Are you at a point where we can
run some tests?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-07 Thread Merlin Moncure
On Mon, Aug 5, 2013 at 11:40 AM, Andres Freund and...@anarazel.de wrote:
 On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote:
 optimization 4: remove free list lock (via Jeff Janes).  This is the
 other optimization: one backend will no longer be able to shut down
 buffer allocation

 I think splitting off the actual freelist checking into a spinlock makes
 quite a bit of sense. And I think a separate one should be used for the
 actual search for the clock sweep.


 I don't think the unlocked increment of nextVictimBuffer is a good idea
 though. nextVictimBuffer jumping over NBuffers under concurrency seems
 like a recipe for disaster to me. At the very, very least it will need a
 good wad of comments explaining what it means and how you're allowed to
 use it. The current way will lead to at least bgwriter accessing a
 nonexistant/out of bounds buffer via StrategySyncStart().
 Possibly it won't even save that much, it might just increase the
 contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep.  This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen.  An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

merlin


buffer3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-07 Thread Andres Freund
On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
  I don't think the unlocked increment of nextVictimBuffer is a good idea
  though. nextVictimBuffer jumping over NBuffers under concurrency seems
  like a recipe for disaster to me. At the very, very least it will need a
  good wad of comments explaining what it means and how you're allowed to
  use it. The current way will lead to at least bgwriter accessing a
  nonexistant/out of bounds buffer via StrategySyncStart().
  Possibly it won't even save that much, it might just increase the
  contention on the buffer header spinlock's cacheline.
 
 I agree; at least then it's not unambiguously better. if you (in
 effect) swap all contention on allocation from a lwlock to a spinlock
 it's not clear if you're improving things; it would have to be proven
 and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

 Attached is a scaled down version of the patch that keeps the freelist
 lock but still removes the spinlock during the clock sweep.  This
 still hits the major objectives of reducing the chance of scheduling
 out while holding the BufFreelistLock and mitigating the worst case
 impact of doing so if it does happen.

FWIW, I am not convinced this is the trigger for the problems you're
seing. It's a good idea nonetheless though.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-07 Thread Merlin Moncure
On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
  I don't think the unlocked increment of nextVictimBuffer is a good idea
  though. nextVictimBuffer jumping over NBuffers under concurrency seems
  like a recipe for disaster to me. At the very, very least it will need a
  good wad of comments explaining what it means and how you're allowed to
  use it. The current way will lead to at least bgwriter accessing a
  nonexistant/out of bounds buffer via StrategySyncStart().
  Possibly it won't even save that much, it might just increase the
  contention on the buffer header spinlock's cacheline.

 I agree; at least then it's not unambiguously better. if you (in
 effect) swap all contention on allocation from a lwlock to a spinlock
 it's not clear if you're improving things; it would have to be proven
 and I'm trying to keep things simple.

 I think converting it to a spinlock actually is a good idea, you just
 need to expand the scope a bit.

all right: well, I'll work up another version doing full spinlock and
maybe see things shake out in performance.

 FWIW, I am not convinced this is the trigger for the problems you're
 seing. It's a good idea nonetheless though.

I have some very strong evidence that the problem is coming out of the
buffer allocator.  Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue.  This problem shows up on fast
machines with fast storage and lots of cores.  So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.
What is not as clear is if the proposed optimizations will fix the
problem -- I'd have to get approval to test and confirm them in
production which seems unlikely at this juncture; that's why I'm
trying to keep things 'win-win' so as to not have to have them be
accepted on that basis.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-07 Thread Atri Sharma
On Wed, Aug 7, 2013 at 10:37 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
  I don't think the unlocked increment of nextVictimBuffer is a good idea
  though. nextVictimBuffer jumping over NBuffers under concurrency seems
  like a recipe for disaster to me. At the very, very least it will need a
  good wad of comments explaining what it means and how you're allowed to
  use it. The current way will lead to at least bgwriter accessing a
  nonexistant/out of bounds buffer via StrategySyncStart().
  Possibly it won't even save that much, it might just increase the
  contention on the buffer header spinlock's cacheline.

 I agree; at least then it's not unambiguously better. if you (in
 effect) swap all contention on allocation from a lwlock to a spinlock
 it's not clear if you're improving things; it would have to be proven
 and I'm trying to keep things simple.

 I think converting it to a spinlock actually is a good idea, you just
 need to expand the scope a bit.

Umm, sorry if I am being naive, but dont spinlocks perform bad when a
lot of contention is present on that node?

I feel that we may hit on that case here. A preliminary check before
the actual spinlock may be good,though,since spinlocks are cheap until
the contention remains low.

Regards,

Atri

-- 
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-07 Thread Amit Kapila


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-
 ow...@postgresql.org] On Behalf Of Merlin Moncure
 Sent: Thursday, August 08, 2013 12:09 AM
 To: Andres Freund
 Cc: PostgreSQL-development; Jeff Janes
 Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2
 
 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund and...@2ndquadrant.com
 wrote:
  On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
   I don't think the unlocked increment of nextVictimBuffer is a good
 idea
   though. nextVictimBuffer jumping over NBuffers under concurrency
 seems
   like a recipe for disaster to me. At the very, very least it will
 need a
   good wad of comments explaining what it means and how you're
 allowed to
   use it. The current way will lead to at least bgwriter accessing a
   nonexistant/out of bounds buffer via StrategySyncStart().
   Possibly it won't even save that much, it might just increase the
   contention on the buffer header spinlock's cacheline.
 
  I agree; at least then it's not unambiguously better. if you (in
  effect) swap all contention on allocation from a lwlock to a
 spinlock
  it's not clear if you're improving things; it would have to be
 proven
  and I'm trying to keep things simple.
 
  I think converting it to a spinlock actually is a good idea, you just
  need to expand the scope a bit.
 
 all right: well, I'll work up another version doing full spinlock and
 maybe see things shake out in performance.
 
  FWIW, I am not convinced this is the trigger for the problems you're
  seing. It's a good idea nonetheless though.
 
 I have some very strong evidence that the problem is coming out of the
 buffer allocator.  Exhibit A is that vlad's presentation of the
 problem was on a read only load (if not allocator lock, then what?).
 Exhibit B is that lowering shared buffers to 2gb seems to have (so
 far, 5 days in) fixed the issue.  This problem shows up on fast
 machines with fast storage and lots of cores.  So what I think is
 happening is that usage_count starts creeping up faster than it gets
 cleared by the sweep with very large buffer settings which in turn
 causes the 'problem' buffers to be analyzed for eviction more often.

  Yes one idea which was discussed previously is to not increase usage
count, every time buffer is pinned.
  I am also working on some of the optimizations on similar area, which you
can refer here:
 
http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
huawei.com


 What is not as clear is if the proposed optimizations will fix the
 problem -- I'd have to get approval to test and confirm them in
 production which seems unlikely at this juncture; that's why I'm
 trying to keep things 'win-win' so as to not have to have them be
 accepted on that basis.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] StrategyGetBuffer optimization, take 2

2013-08-05 Thread Merlin Moncure
My $company recently acquired another postgres based $company and
migrated all their server operations into our datacenter.  Upon
completing the move, the newly migrated database server started
experiencing huge load spikes.


*) Environment description:
Postgres 9.2.4
RHEL 6
32 cores
virtualized (ESX) but with a dedicated host
256GB ram
shared_buffers: 32G
96 application servers configured to max 5 connections each
very fast i/o
database size: ~ 200GB
HS/SR: 3 slaves

*) Problem description:
The server normally hums along nicely with load  1.0 and no iowait --
in fact the server is massively over-provisioned.  However, on
semi-random basis (once every 1-2 days) load absolutely goes through
the roof to 600+, no iowait, 90-100%  (70%+ sys) cpu.  It hangs around
like that for 5-20 minutes then resolves as suddenly as it started.
There is nothing interesting going on application side (except the
application servers are all piling on) but pg_locks is recording lots
of contention on relation 'extension locks'.   One interesting point
is that the slaves are also affected, but the precise point of the
high load affects happens some seconds after the master.

*) Initial steps taken:
RhodiumToad aka (Andrew G) has seen this in the wild several times and
suggested dropping shared_buffers significantly might resolve the
situation short term.  That was done on friday night, and so far
problem has not re-occurred.

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case.  What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled
6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6
8. finally win the lottery. lock released. everything back to normal.

*) what I would like to do to fix it:
see attached patch.  This builds on the work of Jeff Janes to remove
the free list lock and has some extra optimizations in the clock sweep
loop:

optimization 1: usage count is advisory. it is not updated behind the
buffer lock. in the event there are a large sequences of buffers with
0 usage_count, this avoids spamming the cache_line lock; you
decrement and hope for the best

optimization 2: refcount is examined during buffer allocation without
a lock.  if it's  0, buffer is assumed pinned (even though it may not
in fact be) and sweep continues

optimization 3: sweep does not wait on buf header lock.  instead, it
does 'try lock' and bails if the buffer is determined pinned.  I
believe this to be one of the two critical optimizations

optimization 4: remove free list lock (via Jeff Janes).  This is the
other optimization: one backend will no longer be able to shut down
buffer allocation

*) what I'm asking for

Is the analysis and the patch to fix the perceived problem plausible
without breaking other stuff..  If so, I'm inclined to go further with
this.   This is not the only solution on the table for high buffer
contention, but IMNSHO it should get a lot of points for being very
localized.  Maybe a reduced version could be tried retaining the
freelist lock but keeping the 'trylock' on the buf header.

*) further reading:
https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=1cad=rjaved=0CC8QFjAAurl=http%3A%2F%2Fpostgresql.1045698.n5.nabble.com%2FHigh-SYS-CPU-need-advise-td5732045.htmlei=hsb_Uc6pB4Ss9ASN7YHoAgusg=AFQjCNEefMxOvjvW3Alg4TiXqCSAUmDR7Asig2=EyPOQa9XbVEND5kwzTeBJgbvm=bv.50165853,d.eWU

http://www.postgresql.org/message-id/cahyxu0x47d4n6edpynyadshxqqxkohelv2cbrgr_2ngrc8k...@mail.gmail.com

http://postgresql.1045698.n5.nabble.com/Page-replacement-algorithm-in-buffer-cache-td5749236.html


merlin


buffer2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-05 Thread Andres Freund
On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote:
 optimization 4: remove free list lock (via Jeff Janes).  This is the
 other optimization: one backend will no longer be able to shut down
 buffer allocation

I think splitting off the actual freelist checking into a spinlock makes
quite a bit of sense. And I think a separate one should be used for the
actual search for the clock sweep.
I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] StrategyGetBuffer optimization, take 2

2013-08-05 Thread Atri Sharma
 optimization 2: refcount is examined during buffer allocation without
 a lock.  if it's  0, buffer is assumed pinned (even though it may not
 in fact be) and sweep continues

+1.

I think this shall not lead to much problems, since a lost update
cannot,IMO, lead to disastrous result. At most, a buffer page can
survive for an extra clock sweep.


 optimization 3: sweep does not wait on buf header lock.  instead, it
 does 'try lock' and bails if the buffer is determined pinned.  I
 believe this to be one of the two critical optimizations

+1 again. I believe the try lock will be a sort of filter before the
actual check, hence reducing the contention.


Regards,

Atri



-- 
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers