Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Tue, Aug 20, 2013 at 1:57 AM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-19 15:17:44 -0700, Jeff Janes wrote: On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote: I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. Attached is a scaled down version of the patch that keeps the freelist lock but still removes the spinlock during the clock sweep. This still hits the major objectives of reducing the chance of scheduling out while holding the BufFreelistLock and mitigating the worst case impact of doing so if it does happen. An even more scaled down version would keep the current logic exactly as is except for replacing buffer lock in the clock sweep with a trylock (which is IMNSHO a no-brainer). Since usage_count is unsigned, are you sure that changing the tests from buf-usage_count == 0 to buf-usage_count = 0 accomplishes what you need it to? If usage_count gets decremented when it already zero, it will wrap around to 65,535, at least on some compilers some of the time, won't it? Overflow of *unsigned* variables is actually defined and will always wrap around. It's signed variables which don't have such a clear behaviour. Hm, well, even better would be to leave things as they are and try to guarantee that usage_count is updated via assignment vs increment; that way it would be impossible to wander out of bounds. I bet changing: i--; to i=(i-1); isn't going to do much against modern compilers. But what about assignment from a volatile temporary? volatile v = usage_count; if (v 0) v--; usage_count = v; something like that. Or maybe declaring usage_count as volatile might be enough? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On 2013-08-19 15:17:44 -0700, Jeff Janes wrote: On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote: I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. Attached is a scaled down version of the patch that keeps the freelist lock but still removes the spinlock during the clock sweep. This still hits the major objectives of reducing the chance of scheduling out while holding the BufFreelistLock and mitigating the worst case impact of doing so if it does happen. An even more scaled down version would keep the current logic exactly as is except for replacing buffer lock in the clock sweep with a trylock (which is IMNSHO a no-brainer). Since usage_count is unsigned, are you sure that changing the tests from buf-usage_count == 0 to buf-usage_count = 0 accomplishes what you need it to? If usage_count gets decremented when it already zero, it will wrap around to 65,535, at least on some compilers some of the time, won't it? Overflow of *unsigned* variables is actually defined and will always wrap around. It's signed variables which don't have such a clear behaviour. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure mmonc...@gmail.com wrote: *) What I think is happening: I think we are again getting burned by getting de-scheduled while holding the free list lock. I've been chasing this problem for a long time now (for example, see: http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html) but not I've got a reproducible case. What is happening this: 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension 2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer() 3. Lock free list, go into clock sweep loop 4. while holding clock sweep, hit 'hot' buffer, spin on it 5. get de-scheduled 6. now enter the 'hot buffer spin lock lottery' 7. more/more backends pile on, linux scheduler goes bezerk, reducing chances of winning #6 8. finally win the lottery. lock released. everything back to normal. This is an interesting theory, but where's the evidence? I've seen spinlock contention come from enough different places to be wary of arguments that start with it must be happening because Absolutely. My evidence is circumstantial at best -- let's call it a hunch. I also do not think we are facing pure spinlock contention, but something more complex which is a combination of spinlocks, the free list lwlock, and the linux scheduler. This problem showed up going from RHEL 5-6 which brought a lot of scheduler changes. A lot of other things changed too, but the high sys cpu really suggests we are getting some feedback from the scheduler. IMHO, the thing to do here is run perf record -g during one of the trouble periods. The performance impact is quite low. You could probably even set up a script that runs perf for five minute intervals at a time and saves all of the perf.data files. When one of these spikes happens, grab the one that's relevant. Unfortunately -- that's not on the table. Dropping shared buffers to 2GB (thanks RhodiumToad) seems to have fixed the issue and there is zero chance I will get approval to revert that setting in order to force this to re-appear. So far, I have not been able to reproduce in testing. By the way, this problem has popped up in other places too; and the typical remedies are applied until it goes away :(. If you see that s_lock is where all the time is going, then you've proved it's a PostgreSQL spinlock rather than something in the kernel or a shared library. If you can further see what's calling s_lock (which should hopefully be possible with perf -g), then you've got it nailed dead to rights. Well, I don't think it's that simple. So my plan of action is this: 1) Improvise a patch that removes one *possible* trigger for the problem, or at least makes it much less likely to occur. Also, in real world cases where usage_count is examined N times before returning a candidate buffer, the amount of overall spinlocking from buffer allocating is reduced by approximately (N-1)/N. Even though spin locking is cheap, it's hard to argue with that... 2) Exhaustively performance test patch #1. I think this is win-win since SGB clock sweep loop is quite frankly relatively un-optimized. I don't see how reducing the amount of locking could hurt performance but I've been, uh, wrong about these types of things before. 3) If a general benefit without downside is shown from #2, I'll simply advance the patch for the next CF and see how things shake out. If and when I feel like there's a decent shot at getting accepted, I may go through the motions of setting up a patched server in production and attempting shared buffers. But that's a long way off. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Mon, Aug 5, 2013 at 8:49 AM, Merlin Moncure mmonc...@gmail.com wrote: *) What I think is happening: I think we are again getting burned by getting de-scheduled while holding the free list lock. I've been chasing this problem for a long time now (for example, see: http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html) but not I've got a reproducible case. What is happening this: 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension 2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer() 3. Lock free list, go into clock sweep loop 4. while holding clock sweep, hit 'hot' buffer, spin on it 5. get de-scheduled It seems more likely to me that it got descheduled immediately after obtaining the hot buffer header spinlock but before releasing it, rather than while still spinning for it. 6. now enter the 'hot buffer spin lock lottery' 7. more/more backends pile on, linux scheduler goes bezerk, reducing chances of winning #6 Lots and lots of processes (which want the buffer to use, not to evict) are spinning for the lock held by the descheduled process on the buffer header, and the scheduler goes bezerk. A bunch of other processes are waiting on the relation extension lock, but are doing so non-bezerk-ly on the semaphore. Each spinning process puts itself to sleep without having consumed its slice once it hits spins_per_delay, and so the scheduler keeps rescheduling them; rather than scheduling the one that is holding the lock, which consumed its entire slice and so is on the low priority to reschedule list. That is my theory, of course. I'm not sure if it leads to a different course of action than the theory it has not yet obtained the header lock. Any idea what spins_per_delay has converged to? There seems to be no way to obtain this info, other than firing up the debugger. I had a patch somewhere that has every process elog(LOG,...) the value which it fetches from shared memory immediately upon start-up. It is a pain to hunt down the patch, apply, and compiler and restart the server every time I want this value. Maybe something like this could be put in core with a suitable condition, but I don't know what that condition would be. Or would creating a C function with a pg_spins_per_delay() wrapper function which returns this value on demand be the way to go? Is the analysis and the patch to fix the perceived problem plausible without breaking other stuff.. If so, I'm inclined to go further with this. This is not the only solution on the table for high buffer contention, but IMNSHO it should get a lot of points for being very localized. Maybe a reduced version could be tried retaining the freelist lock but keeping the 'trylock' on the buf header. My concern is how we can ever move this forward. If we can't recreate it on a test system, and you probably won't be allowed to push experimental patches to the production systemwhat's left? Also, if the kernel is introducing new scheduling bottlenecks, are we just playing whack-a-mole by trying to deal with it in PG code? Stepping back a bit, have you considered a connection pooler to restrict the number of simultaneously active connections? It wouldn't be the first time that that has alleviated the symptoms of these high-RAM kernel bottlenecks. (If we are going to play whack-a-mole, might as well use a hammer that already exists and is well tested) Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote: I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. Attached is a scaled down version of the patch that keeps the freelist lock but still removes the spinlock during the clock sweep. This still hits the major objectives of reducing the chance of scheduling out while holding the BufFreelistLock and mitigating the worst case impact of doing so if it does happen. An even more scaled down version would keep the current logic exactly as is except for replacing buffer lock in the clock sweep with a trylock (which is IMNSHO a no-brainer). Since usage_count is unsigned, are you sure that changing the tests from buf-usage_count == 0 to buf-usage_count = 0 accomplishes what you need it to? If usage_count gets decremented when it already zero, it will wrap around to 65,535, at least on some compilers some of the time, won't it? It seem safer just not to decrement if we can't get the lock. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Mon, Aug 19, 2013 at 5:02 PM, Jeff Janes jeff.ja...@gmail.com wrote: My concern is how we can ever move this forward. If we can't recreate it on a test system, and you probably won't be allowed to push experimental patches to the production systemwhat's left? Also, if the kernel is introducing new scheduling bottlenecks, are we just playing whack-a-mole by trying to deal with it in PG code? I think I can rest on the basis that there is no good reason to sit and spin on a buffer during clock sweep -- even if I have to whittle the patch down to the '1 liner' variant that trylocks the buffer header. I also like the idea of getting rid of the freelist lock completely but I think those ideas can be tested in isolation. Really though, the acid-test is going to be exhaustive performance testing. Stepping back a bit, have you considered a connection pooler to restrict the number of simultaneously active connections? It wouldn't be the first time that that has alleviated the symptoms of these high-RAM kernel bottlenecks. (If we are going to play whack-a-mole, might as well use a hammer that already exists and is well tested) Certainly. It was the very first thing I suggested upon hearing about this problem. Unfortunately, when these kinds of things happen in high scale, high load databases on serious hardware improvising pgbouncer is simply not possible without first going through extensive performance and application compatibility testing. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Mon, Aug 19, 2013 at 5:17 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure mmonc...@gmail.com wrote: I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. Attached is a scaled down version of the patch that keeps the freelist lock but still removes the spinlock during the clock sweep. This still hits the major objectives of reducing the chance of scheduling out while holding the BufFreelistLock and mitigating the worst case impact of doing so if it does happen. An even more scaled down version would keep the current logic exactly as is except for replacing buffer lock in the clock sweep with a trylock (which is IMNSHO a no-brainer). Since usage_count is unsigned, are you sure that changing the tests from buf-usage_count == 0 to buf-usage_count = 0 accomplishes what you need it to? If usage_count gets decremented when it already zero, it will wrap around to 65,535, at least on some compilers some of the time, won't it? It seem safer just not to decrement if we can't get the lock. Hurk -- well, maybe it should be changed to signed in this implementation (adjustment w/o lock). Safer maybe, but you lose a part of the optimization: not having to spam cache line locks as you constantly spinlock your way around the clock. Maybe that doesn't matter much -- I'm inclined to test it both ways and see -- plus maybe a 3rd variant that manages the freelist itself with a spinlock as well. variant A: buffer spinlock - trylock (1 liner!) variant B: as above, but usage_count manipulation occurs outside of lock variant C: as above, but dispense with lwlock wholly or in part (plus possibly stuff the freelist etc). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure mmonc...@gmail.com wrote: *) What I think is happening: I think we are again getting burned by getting de-scheduled while holding the free list lock. I've been chasing this problem for a long time now (for example, see: http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html) but not I've got a reproducible case. What is happening this: 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension 2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer() 3. Lock free list, go into clock sweep loop 4. while holding clock sweep, hit 'hot' buffer, spin on it 5. get de-scheduled 6. now enter the 'hot buffer spin lock lottery' 7. more/more backends pile on, linux scheduler goes bezerk, reducing chances of winning #6 8. finally win the lottery. lock released. everything back to normal. This is an interesting theory, but where's the evidence? I've seen spinlock contention come from enough different places to be wary of arguments that start with it must be happening because IMHO, the thing to do here is run perf record -g during one of the trouble periods. The performance impact is quite low. You could probably even set up a script that runs perf for five minute intervals at a time and saves all of the perf.data files. When one of these spikes happens, grab the one that's relevant. If you see that s_lock is where all the time is going, then you've proved it's a PostgreSQL spinlock rather than something in the kernel or a shared library. If you can further see what's calling s_lock (which should hopefully be possible with perf -g), then you've got it nailed dead to rights. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
Performance testing this patch is a real bugaboo for me; the VMs I have to work with are too unstable to give useful results :-(. Need to scrounge up a doner box somewhere... On Tue, Aug 13, 2013 at 12:26 AM, Amit Kapila amit.kapil...@gmail.comwrote: Merlin Moncure wrote: On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila amit(dot)kapila(at)huawei(dot)com wrote: -Original Message- From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers- owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure Sent: Thursday, August 08, 2013 12:09 AM To: Andres Freund Cc: PostgreSQL-development; Jeff Janes Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund andres(at)2ndquadrant(dot)com wrote: On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I have some very strong evidence that the problem is coming out of the buffer allocator. Exhibit A is that vlad's presentation of the problem was on a read only load (if not allocator lock, then what?). Exhibit B is that lowering shared buffers to 2gb seems to have (so far, 5 days in) fixed the issue. This problem shows up on fast machines with fast storage and lots of cores. So what I think is happening is that usage_count starts creeping up faster than it gets cleared by the sweep with very large buffer settings which in turn causes the 'problem' buffers to be analyzed for eviction more often. Yes one idea which was discussed previously is to not increase usage count, every time buffer is pinned. I am also working on some of the optimizations on similar area, which you can refer here: http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@ huawei.com yup -- just took a quick look at your proposed patch. You're attacking the 'freelist' side of buffer allocation where my stripped down patch addresses issues with the clocksweep. I think this is a good idea but more than I wanted to get into personally. Good news is that both patches should essentially bolt on together AFAICT. True, I also think so as both are trying to reduce contention in same area. I propose we do a bit of consolidation of performance testing efforts and run tests with patch A, B, and AB in various scenarios. I have a 16 core vm (4gb ram) that I can test with and want to start with say 2gb database 1gb shared_buffers high concurrency test and see how it burns in. What do you think? I think this can mainly benefit with large data and shared buffers ( 10G), last year also I had ran few tests with similar idea's but didn't get much in with less shared buffers. Are you at a point where we can run some tests? Not now, but I will try to run before/during next CF. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On 8/14/13 8:30 AM, Merlin Moncure wrote: Performance testing this patch is a real bugaboo for me; the VMs I have to work with are too unstable to give useful results :-(. Need to scrounge up a doner box somewhere... I offered a server or two to the community a while ago but I don't think anything was ever resolved. It wouldn't have a massive number of cores (only 24 IIRC), but it would have a lot of memory (definitely over 192G; maybe more). The community just needs to decide it wants it and where it gets shipped to... -- Jim Nasby, Lead Data Architect (512) 569-9461 (primary) (512) 579-9024 (backup) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Wed, Aug 14, 2013 at 7:00 PM, Merlin Moncure mmonc...@gmail.com wrote: Performance testing this patch is a real bugaboo for me; the VMs I have to work with are too unstable to give useful results :-(. Need to scrounge up a doner box somewhere... While doing performance tests in this area, I always had a feeling that OS layer (scheduler that flushes OS buffers) had a role to play here, So I use to reboot m/c between tests, ofcourse taking performance data in this fashion is tedious. I think it is good to run tests on m/c with configuration similar to what Jim Nasby mentioned in his below mail. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
Merlin Moncure wrote: On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila amit(dot)kapila(at)huawei(dot)com wrote: -Original Message- From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers- owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure Sent: Thursday, August 08, 2013 12:09 AM To: Andres Freund Cc: PostgreSQL-development; Jeff Janes Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund andres(at)2ndquadrant(dot)com wrote: On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I have some very strong evidence that the problem is coming out of the buffer allocator. Exhibit A is that vlad's presentation of the problem was on a read only load (if not allocator lock, then what?). Exhibit B is that lowering shared buffers to 2gb seems to have (so far, 5 days in) fixed the issue. This problem shows up on fast machines with fast storage and lots of cores. So what I think is happening is that usage_count starts creeping up faster than it gets cleared by the sweep with very large buffer settings which in turn causes the 'problem' buffers to be analyzed for eviction more often. Yes one idea which was discussed previously is to not increase usage count, every time buffer is pinned. I am also working on some of the optimizations on similar area, which you can refer here: http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@ huawei.com yup -- just took a quick look at your proposed patch. You're attacking the 'freelist' side of buffer allocation where my stripped down patch addresses issues with the clocksweep. I think this is a good idea but more than I wanted to get into personally. Good news is that both patches should essentially bolt on together AFAICT. True, I also think so as both are trying to reduce contention in same area. I propose we do a bit of consolidation of performance testing efforts and run tests with patch A, B, and AB in various scenarios. I have a 16 core vm (4gb ram) that I can test with and want to start with say 2gb database 1gb shared_buffers high concurrency test and see how it burns in. What do you think? I think this can mainly benefit with large data and shared buffers ( 10G), last year also I had ran few tests with similar idea's but didn't get much in with less shared buffers. Are you at a point where we can run some tests? Not now, but I will try to run before/during next CF. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila amit.kap...@huawei.com wrote: -Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of Merlin Moncure Sent: Thursday, August 08, 2013 12:09 AM To: Andres Freund Cc: PostgreSQL-development; Jeff Janes Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. I think converting it to a spinlock actually is a good idea, you just need to expand the scope a bit. all right: well, I'll work up another version doing full spinlock and maybe see things shake out in performance. FWIW, I am not convinced this is the trigger for the problems you're seing. It's a good idea nonetheless though. I have some very strong evidence that the problem is coming out of the buffer allocator. Exhibit A is that vlad's presentation of the problem was on a read only load (if not allocator lock, then what?). Exhibit B is that lowering shared buffers to 2gb seems to have (so far, 5 days in) fixed the issue. This problem shows up on fast machines with fast storage and lots of cores. So what I think is happening is that usage_count starts creeping up faster than it gets cleared by the sweep with very large buffer settings which in turn causes the 'problem' buffers to be analyzed for eviction more often. Yes one idea which was discussed previously is to not increase usage count, every time buffer is pinned. I am also working on some of the optimizations on similar area, which you can refer here: http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@ huawei.com yup -- just took a quick look at your proposed patch. You're attacking the 'freelist' side of buffer allocation where my stripped down patch addresses issues with the clocksweep. I think this is a good idea but more than I wanted to get into personally. Good news is that both patches should essentially bolt on together AFAICT. I propose we do a bit of consolidation of performance testing efforts and run tests with patch A, B, and AB in various scenarios. I have a 16 core vm (4gb ram) that I can test with and want to start with say 2gb database 1gb shared_buffers high concurrency test and see how it burns in. What do you think? Are you at a point where we can run some tests? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Mon, Aug 5, 2013 at 11:40 AM, Andres Freund and...@anarazel.de wrote: On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote: optimization 4: remove free list lock (via Jeff Janes). This is the other optimization: one backend will no longer be able to shut down buffer allocation I think splitting off the actual freelist checking into a spinlock makes quite a bit of sense. And I think a separate one should be used for the actual search for the clock sweep. I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. Attached is a scaled down version of the patch that keeps the freelist lock but still removes the spinlock during the clock sweep. This still hits the major objectives of reducing the chance of scheduling out while holding the BufFreelistLock and mitigating the worst case impact of doing so if it does happen. An even more scaled down version would keep the current logic exactly as is except for replacing buffer lock in the clock sweep with a trylock (which is IMNSHO a no-brainer). merlin buffer3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. I think converting it to a spinlock actually is a good idea, you just need to expand the scope a bit. Attached is a scaled down version of the patch that keeps the freelist lock but still removes the spinlock during the clock sweep. This still hits the major objectives of reducing the chance of scheduling out while holding the BufFreelistLock and mitigating the worst case impact of doing so if it does happen. FWIW, I am not convinced this is the trigger for the problems you're seing. It's a good idea nonetheless though. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. I think converting it to a spinlock actually is a good idea, you just need to expand the scope a bit. all right: well, I'll work up another version doing full spinlock and maybe see things shake out in performance. FWIW, I am not convinced this is the trigger for the problems you're seing. It's a good idea nonetheless though. I have some very strong evidence that the problem is coming out of the buffer allocator. Exhibit A is that vlad's presentation of the problem was on a read only load (if not allocator lock, then what?). Exhibit B is that lowering shared buffers to 2gb seems to have (so far, 5 days in) fixed the issue. This problem shows up on fast machines with fast storage and lots of cores. So what I think is happening is that usage_count starts creeping up faster than it gets cleared by the sweep with very large buffer settings which in turn causes the 'problem' buffers to be analyzed for eviction more often. What is not as clear is if the proposed optimizations will fix the problem -- I'd have to get approval to test and confirm them in production which seems unlikely at this juncture; that's why I'm trying to keep things 'win-win' so as to not have to have them be accepted on that basis. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On Wed, Aug 7, 2013 at 10:37 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. I think converting it to a spinlock actually is a good idea, you just need to expand the scope a bit. Umm, sorry if I am being naive, but dont spinlocks perform bad when a lot of contention is present on that node? I feel that we may hit on that case here. A preliminary check before the actual spinlock may be good,though,since spinlocks are cheap until the contention remains low. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers- ow...@postgresql.org] On Behalf Of Merlin Moncure Sent: Thursday, August 08, 2013 12:09 AM To: Andres Freund Cc: PostgreSQL-development; Jeff Janes Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2 On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote: I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. I agree; at least then it's not unambiguously better. if you (in effect) swap all contention on allocation from a lwlock to a spinlock it's not clear if you're improving things; it would have to be proven and I'm trying to keep things simple. I think converting it to a spinlock actually is a good idea, you just need to expand the scope a bit. all right: well, I'll work up another version doing full spinlock and maybe see things shake out in performance. FWIW, I am not convinced this is the trigger for the problems you're seing. It's a good idea nonetheless though. I have some very strong evidence that the problem is coming out of the buffer allocator. Exhibit A is that vlad's presentation of the problem was on a read only load (if not allocator lock, then what?). Exhibit B is that lowering shared buffers to 2gb seems to have (so far, 5 days in) fixed the issue. This problem shows up on fast machines with fast storage and lots of cores. So what I think is happening is that usage_count starts creeping up faster than it gets cleared by the sweep with very large buffer settings which in turn causes the 'problem' buffers to be analyzed for eviction more often. Yes one idea which was discussed previously is to not increase usage count, every time buffer is pinned. I am also working on some of the optimizations on similar area, which you can refer here: http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@ huawei.com What is not as clear is if the proposed optimizations will fix the problem -- I'd have to get approval to test and confirm them in production which seems unlikely at this juncture; that's why I'm trying to keep things 'win-win' so as to not have to have them be accepted on that basis. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] StrategyGetBuffer optimization, take 2
My $company recently acquired another postgres based $company and migrated all their server operations into our datacenter. Upon completing the move, the newly migrated database server started experiencing huge load spikes. *) Environment description: Postgres 9.2.4 RHEL 6 32 cores virtualized (ESX) but with a dedicated host 256GB ram shared_buffers: 32G 96 application servers configured to max 5 connections each very fast i/o database size: ~ 200GB HS/SR: 3 slaves *) Problem description: The server normally hums along nicely with load 1.0 and no iowait -- in fact the server is massively over-provisioned. However, on semi-random basis (once every 1-2 days) load absolutely goes through the roof to 600+, no iowait, 90-100% (70%+ sys) cpu. It hangs around like that for 5-20 minutes then resolves as suddenly as it started. There is nothing interesting going on application side (except the application servers are all piling on) but pg_locks is recording lots of contention on relation 'extension locks'. One interesting point is that the slaves are also affected, but the precise point of the high load affects happens some seconds after the master. *) Initial steps taken: RhodiumToad aka (Andrew G) has seen this in the wild several times and suggested dropping shared_buffers significantly might resolve the situation short term. That was done on friday night, and so far problem has not re-occurred. *) What I think is happening: I think we are again getting burned by getting de-scheduled while holding the free list lock. I've been chasing this problem for a long time now (for example, see: http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html) but not I've got a reproducible case. What is happening this: 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension 2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer() 3. Lock free list, go into clock sweep loop 4. while holding clock sweep, hit 'hot' buffer, spin on it 5. get de-scheduled 6. now enter the 'hot buffer spin lock lottery' 7. more/more backends pile on, linux scheduler goes bezerk, reducing chances of winning #6 8. finally win the lottery. lock released. everything back to normal. *) what I would like to do to fix it: see attached patch. This builds on the work of Jeff Janes to remove the free list lock and has some extra optimizations in the clock sweep loop: optimization 1: usage count is advisory. it is not updated behind the buffer lock. in the event there are a large sequences of buffers with 0 usage_count, this avoids spamming the cache_line lock; you decrement and hope for the best optimization 2: refcount is examined during buffer allocation without a lock. if it's 0, buffer is assumed pinned (even though it may not in fact be) and sweep continues optimization 3: sweep does not wait on buf header lock. instead, it does 'try lock' and bails if the buffer is determined pinned. I believe this to be one of the two critical optimizations optimization 4: remove free list lock (via Jeff Janes). This is the other optimization: one backend will no longer be able to shut down buffer allocation *) what I'm asking for Is the analysis and the patch to fix the perceived problem plausible without breaking other stuff.. If so, I'm inclined to go further with this. This is not the only solution on the table for high buffer contention, but IMNSHO it should get a lot of points for being very localized. Maybe a reduced version could be tried retaining the freelist lock but keeping the 'trylock' on the buf header. *) further reading: https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=1cad=rjaved=0CC8QFjAAurl=http%3A%2F%2Fpostgresql.1045698.n5.nabble.com%2FHigh-SYS-CPU-need-advise-td5732045.htmlei=hsb_Uc6pB4Ss9ASN7YHoAgusg=AFQjCNEefMxOvjvW3Alg4TiXqCSAUmDR7Asig2=EyPOQa9XbVEND5kwzTeBJgbvm=bv.50165853,d.eWU http://www.postgresql.org/message-id/cahyxu0x47d4n6edpynyadshxqqxkohelv2cbrgr_2ngrc8k...@mail.gmail.com http://postgresql.1045698.n5.nabble.com/Page-replacement-algorithm-in-buffer-cache-td5749236.html merlin buffer2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote: optimization 4: remove free list lock (via Jeff Janes). This is the other optimization: one backend will no longer be able to shut down buffer allocation I think splitting off the actual freelist checking into a spinlock makes quite a bit of sense. And I think a separate one should be used for the actual search for the clock sweep. I don't think the unlocked increment of nextVictimBuffer is a good idea though. nextVictimBuffer jumping over NBuffers under concurrency seems like a recipe for disaster to me. At the very, very least it will need a good wad of comments explaining what it means and how you're allowed to use it. The current way will lead to at least bgwriter accessing a nonexistant/out of bounds buffer via StrategySyncStart(). Possibly it won't even save that much, it might just increase the contention on the buffer header spinlock's cacheline. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] StrategyGetBuffer optimization, take 2
optimization 2: refcount is examined during buffer allocation without a lock. if it's 0, buffer is assumed pinned (even though it may not in fact be) and sweep continues +1. I think this shall not lead to much problems, since a lost update cannot,IMO, lead to disastrous result. At most, a buffer page can survive for an extra clock sweep. optimization 3: sweep does not wait on buf header lock. instead, it does 'try lock' and bails if the buffer is determined pinned. I believe this to be one of the two critical optimizations +1 again. I believe the try lock will be a sort of filter before the actual check, hence reducing the contention. Regards, Atri -- Regards, Atri l'apprenant -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers