Re: [HACKERS] Scaling shared buffer eviction
On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have updated the patch to address the feedback. Main changes are: 1. For populating freelist, have a separate process (bgreclaimer) instead of doing it by bgwriter. 2. Autotune the low and high threshold values for buffers in freelist. I have used the formula as suggested by you upthread. 3. Cleanup of locking regimen as discussed upthread (completely eliminated BufFreelist Lock). 4. Improved comments and general code cleanup. Overall this looks quite promising to me. I had thought to call the new process just bgreclaim rather than bgreclaimer, but perhaps your name is better after all. At least, it matches what we do elsewhere. But I don't care for the use Bgreclaimer; let's do BgReclaimer if we really need mixed-case, or else bgreclaimer. This is unclear: +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that +is needed to perform clock sweep which is a longer operation, however now we +are using two spinklocks freelist_lck and victimbuf_lck to perform operations +on freelist and run clock sweep respectively. I would drop the discussion of what was done before and say something like this: The data structures relating to buffer eviction are protected by two spinlocks. freelist_lck protects the freelist and related data structures, while victimbuf_lck protects information related to the current clock sweep condition. +always in this list. We also throw buffers into this list if we consider +their pages unlikely to be needed soon; this is done by background process +reclaimer. The list is singly-linked using fields in the I suggest: Allocating pages from this list is much cheaper than running the clock sweep algorithm, which may encounter many buffers that are poor candidates for eviction before finding a good candidate. Therefore, we have a background process called bgreclaimer which works to keep this list populated. +Background Reclaimer's Processing +- I suggest titling this section Background Reclaim. +The background reclaimer is designed to move buffers to freelist that are I suggest replacing the first three words of this sentence with bgreclaimer. +and move the the unpinned and zero usage count buffers to freelist. It +keep on doing this until the number of buffers in freelist become equal +high threshold of freelist. s/keep/keeps/ s/become equal/reaches the/ s/high threshold/high water mark/ s/of freelist// Please change the other places that say threshold to use the water mark terminology. +if (StrategyMoveBufferToFreeListEnd (bufHdr)) Extra space. + * buffers in consecutive cycles. s/consecutive/later/ +/* Execute the LRU scan */ s/LRU scan/clock sweep/ ? +while (tmp_num_to_free 0) I am not sure it's a good idea for this value to be fixed at loop start and then just decremented. Shouldn't we loop and do the whole thing over once we reach the high watermark, only stopping when StrategySyncStartAndEnd() says num_to_free is 0? +/* choose next victim buffer to clean. */ This process doesn't clean buffers; it puts them on the freelist. + * high threshold of freelsit), we drasticaly reduce the odds for Two typos. + * of buffers in freelist fall below low threshold of freelist. s/fall/falls/ In freelist.c, it seems like a poor idea to have two spinlocks as consecutive structure members; they'll be in the same cache line, leading to false sharing. If we merge them into a single spinlock, does that hurt performance? If we put them further apart, e.g. by moving the freelist_lck to the start of the structure, followed by the latches, and leaving victimbuf_lck where it is, does that help performance? +/* + * If the buffer is pinned or has a nonzero usage_count, we cannot use + * it; discard it and retry. (This can only happen if VACUUM put a + * valid buffer in the freelist and then someone else used it before + * we got to it. It's probably impossible altogether as of 8.3, but + * we'd better check anyway.) + */ + This comment is clearly obsolete. I have not yet added statistics (buffers_backend_clocksweep) as for that we need to add one more variable in BufferStrategyControl structure where I have already added few variables for this patch. I think it is important to have such a stat available via pg_stat_bgwriter, but not sure if it is worth to make the structure bit more bulky. I think it's worth it. Another minor point is about changes in lwlock.h lwlock.h * if you remove a lock, consider leaving a gap in the numbering * sequence for the benefit of DTrace and other external debugging * scripts. As I have removed BufFreelist lock, I have adjusted the numbering as well in lwlock.h. There is a meesage on top of lock definitions which suggest to leave gap if we remove any
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com wrote: I have yet to collect data under varying loads, however I have collected performance data for 8GB shared buffers which shows reasonably good performance and scalability. I think the main part left for this patch is more data for various loads which I will share in next few days, however I think patch is ready for next round of review, so I will mark it as Needs Review. I have collected more data with the patch. I understand that you have given more review comments due to which patch require changes, however I think it will not effect the performance data to a great extent and I have anyway taken the data, so sharing the same. Performance Data: --- Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins All the data is in tps and taken using pgbench read-only load Common configuration remains same as above. Shared_Buffers = 500MB Client Count/Patch_Ver 8 16 32 64 128 HEAD 56248 100112 121341 81128 56552 Patch 59389 112483 157034 185740 166725 Shared_Buffers = 1GB Client Count/Patch_Ver 8 16 32 64 128 HEAD 56401 102557 121643 81686 57091 Patch 59361 114813 157528 188209 167752 Shared_Buffers = 14GB Client Count/Patch_Ver 8 16 32 64 128 HEAD 60059 110582 152051 130718 97014 Patch 61462 117377 169767 225803 229083 Shared_Buffers = 15GB Client Count/Patch_Ver 8 16 32 64 128 HEAD 60005 112928 153650 135203 36343 Patch 61345 115569 168767 226150 36985 Observations - 1. Performance improvement is upto 2~3 times for higher client counts (64, 128). 2. For lower client count (8), we can see 2~5 % performance improvement. 3. Overall, this improves the read scalability. 4. For lower number of shared buffers, we see that there is a minor dip in tps even after patch (it might be that we can improve it by tuning higher water mark for the number of buffers on freelist, I will try this by varying high water mark). 5. For larger shared buffers (15GB), we can see that there is still a dip at large client count, although situation is not bad as compare to HEAD. The reason is that at such high shared buffers and client count, I/O starts happening because all the data can't be contained in RAM. I will try to take some data for tpc-b load as well. Kindly let me know if you want to see data for some other configuration. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com wrote: I have yet to collect data under varying loads, however I have collected performance data for 8GB shared buffers which shows reasonably good performance and scalability. I think the main part left for this patch is more data for various loads which I will share in next few days, however I think patch is ready for next round of review, so I will mark it as Needs Review. I have collected more data with the patch. I understand that you have given more review comments due to which patch require changes, however I think it will not effect the performance data to a great extent and I have anyway taken the data, so sharing the same. Performance Data: --- Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins All the data is in tps and taken using pgbench read-only load Common configuration remains same as above. Forgot to mention that data is a median of 3 runs and attached sheet contains data for individual runs. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com perf_read_scalability_data_v5.ods Description: application/vnd.oasis.opendocument.spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else. Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I'm rather concerned by this cavalier assumption that we can protect fields a,b,c with one lock and fields x,y,z in the same struct with some other lock. A minimum requirement for that to work safely at all is that the fields are of atomically fetchable/storable widths; which might be okay here but it's a restriction that bears thinking about (and documenting). But quite aside from safety, the fields are almost certainly going to be in the same cache line which means contention between processes that are trying to fetch or store them concurrently. For a patch whose sole excuse for existence is to improve performance, that should be a very scary concern. (And yes, I realize these issues already affect the freelist. Perhaps that's part of the reason we have performance issues with it.) False sharing is certainly a concern that has crossed my mine while looking at Amit's work, but the performance numbers he's posted upthread are stellar. Maybe we can squeeze some additional performance out of this by padding out the cache lines, but it's probably minor compared to the gains he's already seeing. I think we should focus on trying to lock in those gains, and then we can consider what further things we may want to do after that. If it turns out that structure-padding is among those things, that's easy enough to do as a separate patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila amit.kapil...@gmail.com wrote: Today, while working on updating the patch to improve locking I found that as now we are going to have a new process, we need a separate latch in StrategyControl to wakeup that process. Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. Sounds reasonable. I think the key thing at this point is to get a new version of the patch with the background reclaim running in a different process than the background writer. I don't see much point in fine-tuning the locking regimen until that's done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: Incidentally, while I generally think your changes to the locking regimen in StrategyGetBuffer() are going in the right direction, they need significant cleanup. Your patch adds two new spinlocks, freelist_lck and victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and you've now got StrategyGetBuffer() running with no lock at all when accessing some things that used to be protected by BufFreelistLock; specifically, you're doing StrategyControl-numBufferAllocs++ and SetLatch(StrategyControl-bgwriterLatch) without any locking. That's not OK. I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else. Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the LWLock is acquired today. Increment StrategyControl-numBufferAllocs; save the values of StrategyControl-bgwriterLatch; pop a buffer off the freelist if there is one, saving its identity. Release the spinlock. Then, set the bgwriterLatch if needed. In the first loop, first check whether the buffer we previously popped from the freelist is pinned or has a non-zero usage count and return it if not, holding the buffer header lock. Otherwise, reacquire the spinlock just long enough to pop a new potential victim and then loop around. Today, while working on updating the patch to improve locking I found that as now we are going to have a new process, we need a separate latch in StrategyControl to wakeup that process. Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I thought it is better to mention about above points so that if you have any different thoughts about it, then it is better to discuss them now rather than after I take performance data with this locking protocol. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
Amit Kapila amit.kapil...@gmail.com writes: On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: I think you should get rid of BufFreelistLock completely and just decide that freelist_lck will protect everything the freeNext links, plus everything in StrategyControl except for nextVictimBuffer. victimbuf_lck will protect nextVictimBuffer and nothing else. Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I'm rather concerned by this cavalier assumption that we can protect fields a,b,c with one lock and fields x,y,z in the same struct with some other lock. A minimum requirement for that to work safely at all is that the fields are of atomically fetchable/storable widths; which might be okay here but it's a restriction that bears thinking about (and documenting). But quite aside from safety, the fields are almost certainly going to be in the same cache line which means contention between processes that are trying to fetch or store them concurrently. For a patch whose sole excuse for existence is to improve performance, that should be a very scary concern. (And yes, I realize these issues already affect the freelist. Perhaps that's part of the reason we have performance issues with it.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: Another point is I think it will be better to protect StrategyControl-completePasses with victimbuf_lck rather than freelist_lck, as when we are going to update it we will already be holding the victimbuf_lck and it doesn't make much sense to release the victimbuf_lck and reacquire freelist_lck to update it. I'm rather concerned by this cavalier assumption that we can protect fields a,b,c with one lock and fields x,y,z in the same struct with some other lock. In some cases, it could be beneficial especially when a,b,c are going to be more frequently accessed as compare to x,y,z. A minimum requirement for that to work safely at all is that the fields are of atomically fetchable/storable widths; which might be okay here but it's a restriction that bears thinking about (and documenting). But quite aside from safety, the fields are almost certainly going to be in the same cache line which means contention between processes that are trying to fetch or store them concurrently. I think patch will reduce the contention for some of such variables (which are accessed during clock sweep) as it will minimize the need to perform clock sweep by backends. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote: Overall, the main changes required in patch as per above feedback are: 1. add an additional counter for the number of those allocations not satisfied from the free list, with a name like buffers_alloc_clocksweep. 2. Autotune the low and high threshold values for buffers in freelist. In the patch, I have kept them as hard-coded values. 3. For populating freelist, have a separate process (bgreclaim) instead of doing it by bgwriter. I'm not convinced that 3) is the right way to go to be honest. Seems like a huge bandaid to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-13 09:51:58 +0530, Amit Kapila wrote: Overall, the main changes required in patch as per above feedback are: 1. add an additional counter for the number of those allocations not satisfied from the free list, with a name like buffers_alloc_clocksweep. 2. Autotune the low and high threshold values for buffers in freelist. In the patch, I have kept them as hard-coded values. 3. For populating freelist, have a separate process (bgreclaim) instead of doing it by bgwriter. I'm not convinced that 3) is the right way to go to be honest. Seems like a huge bandaid to me. Doing both (populating freelist and flushing dirty buffers) via bgwriter isn't the best way either because it might not be able to perform both the jobs as per need. One example is it could take much longer time to flush a dirty buffer than to move it into free list, so if there are few buffers which we need to flush, then I think task of maintaining buffers in freelist will get hit even though there are buffers in list which can be moved to free list(non-dirty buffers). Another is maintaining the current behaviour of bgwriter which is to scan the entire buffer pool every few mins (assuming default configuration). We can attempt to solve this problem as suggested by Robert upthread but I am not completely sure if that can guarantee that the current behaviour will be retained as it is. I am not telling that having a separate process won't have any issues, but I think we can tackle them without changing or complicating current bgwriter behaviour. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 2014-08-06 15:42:08 +0530, Amit Kapila wrote: On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila amit.kapil...@gmail.com wrote: This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers. FWIW, I found this email amost unreadable because it misses quoting signs after linebreaks in quoted content. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Aug 13, 2014 at 2:32 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-08-06 15:42:08 +0530, Amit Kapila wrote: On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers. FWIW, I found this email amost unreadable because it misses quoting signs after linebreaks in quoted content. I think I have done something wrong while replying to Robert's mail, the main point in that mail was trying to see if there is any major problem incase we have separate process (bgreclaim) to populate freelist. One thing which I thought could be problematic is to put a buf in freelist which has usage_count as zero and is *dirty*. Please do let me know if you want clarification for something in particular. Overall, the main changes required in patch as per above feedback are: 1. add an additional counter for the number of those allocations not satisfied from the free list, with a name like buffers_alloc_clocksweep. 2. Autotune the low and high threshold values for buffers in freelist. In the patch, I have kept them as hard-coded values. 3. For populating freelist, have a separate process (bgreclaim) instead of doing it by bgwriter. There are other things also which I need to take care as per feedback like some change in locking strategy and code. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila amit.kapil...@gmail.com wrote: This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers. I have kept it just for the reason that if the basic approach is sounds reasonable/accepted, then I will clean it up. Sorry for the inconvenience, I didn't realized that it can be annoying for reviewer, I will remove all such code from patch in next version. A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least one behavior that some people (at least, me) will care about, which is the guarantee that the background writer will scan the entire buffer pool at least every couple of minutes. Okay, I will take care of this based on the conclusion of the other points in this mail. This is important because it guarantees that dirty data doesn't sit in memory forever. When the system becomes busy again after a long idle period, users will expect the system to have used the idle time to flush dirty buffers to disk. This also improves data recovery prospects if, for example, somebody loses their pg_xlog directory - there may be dirty buffers whose contents are lost, of course, but they won't be months old. b. New stats for number of buffers on freelist has been added, some old one's like maxwritten_clean can be removed as new logic for syncing buffers and moving them to free list doesn't use them. However I think it's better to remove them once the new logic is accepted. Added some new logs for info related to free list under BGW_DEBUG If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist. That makes no sense. I think using 'number of buffers currently on the freelist' and 'number of recently allocated buffers' for consecutive cycles, we can figure out approximately how many buffer allocations needs clock sweep assuming low and high threshold water marks are fixed. However there can be cases where it is not easy to estimate that number. I think what you should be counting is the number of allocations that are being satisfied from the free-list. Then, by comparing the rate at which that value is incrementing to the rate at which buffers_alloc is incrementing, somebody can figure out what percentage of allocations are requiring a clock-sweep run. Actually, I think it's better to flip it around: count the number of allocations that require an individual backend to run the clock sweep (vs. being satisfied from the free-list); call it, say, buffers_backend_clocksweep. We can then try to tune the patch to make that number as small as possible under varying workloads. This can give us clear idea to tune the patch, however we need to maintain 3 counters for it in code(recent_alloc (needed for current bgwriter logic) and other 2 suggested by you). Do you want to retain such counters in code or it's for kind of debug info for patch? d. Autotune the low and high threshold for freelist for various configurations. I think we need to come up with some kind of formula here rather than just a list of hard-coded constants. That was my initial intention as well and I have tried based on number of shared buffers like keeping threshold values as percentage of shared buffers but nothing could satisfy different kind of workloads. The current values I have choosen are based on experiments for various workloads at different thresholds. I have shown the lwlock_stats data for various loads based on current thresholds upthread. Another way could be to make them as config knobs and use the values as given by user incase it is provided by user else go with fixed values. There are other instances in code as well (one of them I remember offhand is in pglz_compress) where we use fixed values based on different sizes. And it definitely needs some comments explaining the logic behind the choices. Agreed, I shall improve them in next version of patch. Aside from those specific remarks, I think the elephant in the room is the question of whether it really makes sense to have one process which is responsible both for populating the free list and for writing buffers to disk. One problem, which I alluded to above under point (1), is
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila amit.kapil...@gmail.com wrote: If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist. That makes no sense. I think using 'number of buffers currently on the freelist' and 'number of recently allocated buffers' for consecutive cycles, we can figure out approximately how many buffer allocations needs clock sweep assuming low and high threshold water marks are fixed. However there can be cases where it is not easy to estimate that number. Counters should be design in such a way that you can read it, and then read it again later, and make sense of it - you should not need to read the counter on *consecutive* cycles to interpret it. I think what you should be counting is the number of allocations that are being satisfied from the free-list. Then, by comparing the rate at which that value is incrementing to the rate at which buffers_alloc is incrementing, somebody can figure out what percentage of allocations are requiring a clock-sweep run. Actually, I think it's better to flip it around: count the number of allocations that require an individual backend to run the clock sweep (vs. being satisfied from the free-list); call it, say, buffers_backend_clocksweep. We can then try to tune the patch to make that number as small as possible under varying workloads. This can give us clear idea to tune the patch, however we need to maintain 3 counters for it in code(recent_alloc (needed for current bgwriter logic) and other 2 suggested by you). Do you want to retain such counters in code or it's for kind of debug info for patch? I only mean to propose one new counter, and I'd imagine including that in the final patch. We already have a counter of total buffer allocations; that's buffers_alloc. I'm proposing to add an additional counter for the number of those allocations not satisfied from the free list, with a name like buffers_alloc_clocksweep (I said buffers_backend_clocksweep above, but that's probably not best, as the existing buffers_backend counts buffer *writes*, not allocations). I think we would definitely want to retain this counter in the final patch, as an additional column in pg_stat_bgwriter. d. Autotune the low and high threshold for freelist for various configurations. I think we need to come up with some kind of formula here rather than just a list of hard-coded constants. That was my initial intention as well and I have tried based on number of shared buffers like keeping threshold values as percentage of shared buffers but nothing could satisfy different kind of workloads. The current values I have choosen are based on experiments for various workloads at different thresholds. I have shown the lwlock_stats data for various loads based on current thresholds upthread. Another way could be to make them as config knobs and use the values as given by user incase it is provided by user else go with fixed values. How did you go about determining the optimal value for a particular workload? When the list is kept short, it's less likely that a value on the list will be referenced or dirtied again before the page is actually recycled. That's clearly good. But when the list is long, it's less likely to become completely empty and thereby force individual backends to run the clock-sweep. My suspicion is that, when the number of buffers is small, the impact of the list being too short isn't likely to be very significant, because running the clock-sweep isn't all that expensive anyway - even if you have to scan through the entire buffer pool multiple times, there aren't that many buffers. But when the number of buffers is large, those repeated scans can cause a major performance hit, so having an adequate pool of free buffers becomes much more important. I think your list of high-watermarks is far too generous for low buffer counts. With more than 100k shared buffers, you've got a high-watermark of 2k buffers, which means that 2% or less of the buffers will be on the freelist, which seems a little on the high side to me, but probably in the ballpark of what we should be aiming for. But at 10001 shared buffers, you can have 1000 of them on the freelist, which is 10% of the buffer pool; that seems high. At 101 shared buffers, 75% of the buffers in the system can be on the freelist; that seems ridiculous. The chances of a buffer still being unused by the time it reaches the head of the freelist seem very small. Based on your existing list of thresholds, and taking the above into account, I'd suggest something like this: let the high-watermark for the freelist be 0.5% of the total number of buffers, with a maximum of 2000 and a minimum of 5. Let the low-watermark be 20% of the high-watermark. That might not be best, but I think some kind of formula like that can likely be made to work. I would
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have improved the patch by making following changes: a. Improved the bgwriter logic to log for xl_running_xacts info and removed the hibernate logic as bgwriter will now work only when there is scarcity of buffer's in free list. Basic idea is when the number of buffers on freelist drops below the low threshold, the allocating backend sets the latch and bgwriter wakesup and begin adding buffer's to freelist until it reaches high threshold and then again goes back to sleep. This essentially removes BgWriterDelay, but it's still mentioned in BgBufferSync(). Looking further, I see that with the patch applied, BgBufferSync() is still present in the source code but is no longer called from anywhere. Please don't submit patches that render things unused without actually removing them; it makes it much harder to see what you've changed. I realize you probably left it that way for testing purposes, but you need to clean such things up before submitting. Likewise, if you've rendered GUCs or statistics counters removed, you need to rip them out, so that the scope of the changes you've made is clear to reviewers. A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least one behavior that some people (at least, me) will care about, which is the guarantee that the background writer will scan the entire buffer pool at least every couple of minutes. This is important because it guarantees that dirty data doesn't sit in memory forever. When the system becomes busy again after a long idle period, users will expect the system to have used the idle time to flush dirty buffers to disk. This also improves data recovery prospects if, for example, somebody loses their pg_xlog directory - there may be dirty buffers whose contents are lost, of course, but they won't be months old. b. New stats for number of buffers on freelist has been added, some old one's like maxwritten_clean can be removed as new logic for syncing buffers and moving them to free list doesn't use them. However I think it's better to remove them once the new logic is accepted. Added some new logs for info related to free list under BGW_DEBUG If I'm reading this right, the new statistic is an incrementing counter where, every time you update it, you add the number of buffers currently on the freelist. That makes no sense. I think what you should be counting is the number of allocations that are being satisfied from the free-list. Then, by comparing the rate at which that value is incrementing to the rate at which buffers_alloc is incrementing, somebody can figure out what percentage of allocations are requiring a clock-sweep run. Actually, I think it's better to flip it around: count the number of allocations that require an individual backend to run the clock sweep (vs. being satisfied from the free-list); call it, say, buffers_backend_clocksweep. We can then try to tune the patch to make that number as small as possible under varying workloads. c. Used the already existing bgwriterLatch in BufferStrategyControl to wake bgwriter when number of buffer's in freelist drops below threshold. Seems like a good idea. d. Autotune the low and high threshold for freelist for various configurations. Generally if keep small number (200~2000) of buffers always available on freelist, then even for high shared buffers like 15GB, it appears to be sufficient. However when the value of shared buffer's is less, then we need much smaller number. I think we can provide these as config knobs for user as well, but for now based on LWLOCK_STATS result, I have chosen some hard coded values for low and high threshold values for freelist. Values for low and high threshold have been decided based on total number of shared buffers, basically I have divided them into 5 categories (16~100, 100~1000, 1000~1, 1~10, 10 and above) and then ran tests(read-only pgbench) for various configurations falling under these categories. The reason for keeping lesser categories for larger shared buffers is that if there are small number (200~2000) of buffers available on free list, then it seems to be sufficient for quite high loads, however as the total number of shared buffer's decreases we need to be more careful as if we keep the number as too low then it will lead to more clock sweep by backends (which means freelist lock contention) and if we keep number higher bgwriter will evict many useful buffers. Results based on LWLOCK_STATS is at end of mail. I think we need to come up with some kind of formula here rather than just a list of hard-coded constants. And it definitely needs some comments explaining the logic behind the choices.
Re: [HACKERS] Scaling shared buffer eviction
On Mon, Jun 9, 2014 at 9:33 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner kgri...@ymail.com wrote: Backend processes related to user connections still performed about 30% of the writes, and this work shows promise toward bringing that down, which would be great; but please don't eliminate the ability to prevent write stalls in the process. I am planing to take some more performance data, part of which will be write load as well, but I am now sure if that can anyway show the need as mentioned by you. After taking the performance data for write load using tpc-b with the patch, I found that there is a regression in it. So I went ahead and tried to figure out the reason for same and found that after patch, Bgwriter started flushing buffers which were required by backends and reason was that *nextVictimBuffer* was not getting updated properly while we are running clock sweep kind of logic (decrement the usage count when number of buffers on freelist fall below low threshhold value) in Bgwriter. In HEAD, I noticed that at default settings, BGwriter was not at all flushing any buffers which is at least better than what my patch was doing (flushing buffers required by backend). So I tried to fix the issue by updating *nextVictimBuffer* in new BGWriter logic and results are positive. sbe - scalable buffer eviction Select only Data Client count/TPS64128Un-patched4523217310sbe_v3111468114521sbe_v4153137 160752 TPC-B Client count/TPS 64128Un-patched825784sbe_v4814845 For Select Data, I am quite confident that it will improve if we introduce nextVictimBuffer increments in BGwriter and rather it scales much better with that change, however for TPC-B, I am getting fluctuation in data, so not sure it has eliminated the problem. The main difference is that in HEAD, BGwriter never increments nextVictimBuffer during syncing the buffers, it just notes down the current setting before start and then proceeds sequentially. I think it will be good if we can have a new process for moving buffers to free list due to below reasons: a. while trying to move buffers to freelist, it should not block due to in between write activity. b. The writer should not increment nextVictimBuffer and maintain the current logic. One significant change in this version of patch is to use a separate spin lock to protect nextVictimBuffer rather than using BufFreelistLock. Suggestions? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com scalable_buffer_eviction_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner kgri...@ymail.com wrote: Amit Kapila amit.kapil...@gmail.com wrote: I have improved the patch by making following changes: a. Improved the bgwriter logic to log for xl_running_xacts info and removed the hibernate logic as bgwriter will now work only when there is scarcity of buffer's in free list. Basic idea is when the number of buffers on freelist drops below the low threshold, the allocating backend sets the latch and bgwriter wakesup and begin adding buffer's to freelist until it reaches high threshold and then again goes back to sleep. The numbers from your benchmarks are very exciting, but the above concerns me. My tuning of the bgwriter in production has generally *not* been aimed at keeping pages on the freelist, Just to be clear, prior to this patch, the bgwriter has never been in the business of putting pages on the freelist in the first place, so it wouldn't have been possible for you to tune for that. Essentially I was able to tune the bgwriter so that a dirty page was always push out to the OS cache within three seconds, which led to a healthy balance of writes between the checkpoint process and the bgwriter. Backend processes related to user connections still performed about 30% of the writes, and this work shows promise toward bringing that down, which would be great; but please don't eliminate the ability to prevent write stalls in the process. I think, as Amit says downthread, that the crucial design question here is whether we need two processes, one to populate the freelist so that regular backends don't need to run the clock sweep, and a second to flush dirty buffers, or whether a single process can serve both needs. In favor of a single process, many people have commented that the background writer doesn't seem to do much right now. If the process is mostly sitting around idle, then giving it more responsibilities might be OK. In favor of having a second process, I'm a little concerned that if the background writer gets busy writing a page, it might then be unavailable to populate the freelist until it finishes, which might be a very long time relative to the buffer allocation needs of other backends. I'm not sure what the right answer is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Amit Kapila amit.kapil...@gmail.com wrote: I have improved the patch by making following changes: a. Improved the bgwriter logic to log for xl_running_xacts info and removed the hibernate logic as bgwriter will now work only when there is scarcity of buffer's in free list. Basic idea is when the number of buffers on freelist drops below the low threshold, the allocating backend sets the latch and bgwriter wakesup and begin adding buffer's to freelist until it reaches high threshold and then again goes back to sleep. The numbers from your benchmarks are very exciting, but the above concerns me. My tuning of the bgwriter in production has generally *not* been aimed at keeping pages on the freelist, but toward preventing shared_buffers from accumulating a lot of dirty pages, which were leading to cascades of writes between caches and thus to write stalls. By pushing dirty pages into the (*much* larger) OS cache, and letting write combining happen there, where the OS could pace based on the total number of dirty pages instead of having some hidden and appearing rather suddenly, latency spikes were avoided while not causing any noticeable increase in the number of OS writes to the RAID controller's cache. Essentially I was able to tune the bgwriter so that a dirty page was always push out to the OS cache within three seconds, which led to a healthy balance of writes between the checkpoint process and the bgwriter. Backend processes related to user connections still performed about 30% of the writes, and this work shows promise toward bringing that down, which would be great; but please don't eliminate the ability to prevent write stalls in the process. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner kgri...@ymail.com wrote: Amit Kapila amit.kapil...@gmail.com wrote: I have improved the patch by making following changes: a. Improved the bgwriter logic to log for xl_running_xacts info and removed the hibernate logic as bgwriter will now work only when there is scarcity of buffer's in free list. Basic idea is when the number of buffers on freelist drops below the low threshold, the allocating backend sets the latch and bgwriter wakesup and begin adding buffer's to freelist until it reaches high threshold and then again goes back to sleep. The numbers from your benchmarks are very exciting, but the above concerns me. My tuning of the bgwriter in production has generally *not* been aimed at keeping pages on the freelist, but toward preventing shared_buffers from accumulating a lot of dirty pages, which were leading to cascades of writes between caches and thus to write stalls. By pushing dirty pages into the (*much* larger) OS cache, and letting write combining happen there, where the OS could pace based on the total number of dirty pages instead of having some hidden and appearing rather suddenly, latency spikes were avoided while not causing any noticeable increase in the number of OS writes to the RAID controller's cache. Essentially I was able to tune the bgwriter so that a dirty page was always push out to the OS cache within three seconds, which led to a healthy balance of writes between the checkpoint process and the bgwriter. I think it would have been better if bgwriter does writes based on the amount of buffer's that get dirtied to achieve the balance of writes. Backend processes related to user connections still performed about 30% of the writes, and this work shows promise toward bringing that down, which would be great; but please don't eliminate the ability to prevent write stalls in the process. I agree that for some cases as explained by you, the current bgwriter logic does satisfy the need, however there are other cases as well where actually it doesn't help much, one of such cases I am trying to improve (ease backend buffer allocations) and other may be when there is constant write activity for which I am not sure how much it really helps. Part of the reason for trying to make bgwriter respond mainly to ease backend allocations is the previous discussion for the same, refer below link: http://www.postgresql.org/message-id/ca+tgmoz7dvhc4h-ffjmzcff6vwynfoeapz021vxw61uh46r...@mail.gmail.com However if we want to retain current property of bgwriter, we can do the same by one of below ways: a. Have separate processes for writing dirty buffers and moving buffers to freelist. b. In the current bgwriter, separate the two works based on the need. The need can be decided based on whether bgwriter has been waked due to shortage of buffers on free list or if it has been waked due to BgWriterDelay. Now as populating freelist and balance writes by writing dirty buffers are two separate responsibilities, so not sure if doing that by one process is a good idea. I am planing to take some more performance data, part of which will be write load as well, but I am now sure if that can anyway show the need as mentioned by you. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Thu, May 15, 2014 at 11:11 AM, Amit Kapila amit.kapil...@gmail.comwrote: Data with LWLOCK_STATS -- BufMappingLocks PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101 PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72 PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54 PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60 PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128 PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81 PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503 PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71 PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46 PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30 PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14 PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25 PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30 PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28 PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25 PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22 PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29 PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36 PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22 This data shows that after patch, there is no contention for BufFreeListLock, rather there is a huge contention around BufMappingLocks. I have checked that HEAD also has contention around BufMappingLocks. As per my analysis till now, I think reducing contention around BufFreelistLock is not sufficient to improve scalability, we need to work on reducing contention around BufMappingLocks as well. To reduce the contention around BufMappingLocks, I have tried the patch by just increasing the Number of Buffer Partitions, and it actually shows a really significant increase in scalability both due to reduced contention around BufFreeListLock and BufMappingLocks. The real effect of reducing contention around BufFreeListLock was hidden because the whole contention was shifted to BufMappingLocks. I have taken performance data for both HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the benefit of reducing contention around BufFreeListLock. This data has been taken using pgbench read only load (Select). Performance Data --- HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) + LOG2_NUM_LOCK_PARTITIONS(6)) V1 + 64 = PATCH + (NUM_BUFFER_PARTITONS(64) + LOG2_NUM_LOCK_PARTITIONS(6)) Similarly 128 means 128 buffer partitions shared_buffers= 8GB scale factor = 3000 RAM - 64GB Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 shared_buffers= 8GB scale factor = 1000 RAM - 64GB Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64 117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822 Observations - 1. There is increase of upto 5 times in performance for data that can fit in memory but not in shared buffers 2. Though there is a increase in performance by just increasing number of buffer partitions, but it doesn't scale well (especially see the case when partitions have increased to 128 from 64). I have verified that contention has reduced around BufMappingLocks by running the patch with LWLOCKS BufFreeListLock PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1 BufMappingLocks PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0 PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0 PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0 PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0 .. .. PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0 PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0 PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0 PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0 We can clearly notice that the number for *blk* has reduced significantly which shows that contention has reduced. The patch is still in a shape to prove the merit of idea and I have just changed the number of partitions so that if someone wants to verify the performance for similar load, it can be done by just applying the patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com scalable_buffer_eviction_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription:
Re: [HACKERS] Scaling shared buffer eviction
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila amit.kapil...@gmail.comwrote: Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault. In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different. I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now: http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash I never got quite as far as trying to hook that up to the buffer mapping machinery, but maybe that would be worth doing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Scaling shared buffer eviction
On Fri, May 16, 2014 at 7:51 AM, Amit Kapila amit.kapil...@gmail.comwrote: shared_buffers= 8GB scale factor = 3000 RAM - 64GB Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64 105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544 shared_buffers= 8GB scale factor = 1000 RAM - 64GB Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64 117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822 I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run? What does Thrds denote? -- Peter Geoghegan
Re: [HACKERS] Scaling shared buffer eviction
On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan p...@heroku.com wrote: On Fri, May 16, 2014 at 7:51 AM, Amit Kapila amit.kapil...@gmail.com wrote: shared_buffers= 8GB scale factor = 3000 RAM - 64GB I'm having a little trouble following this. These figure are transactions per second for a 300 second pgbench tpc-b run? Yes, the figures are tps for a 300 second run. It is for select-only transactions. What does Thrds denote? It denotes number of threads (-j in pgbench run) I have used below statements to take data ./pgbench -c 64 -j 64 -T 300 -S postgres ./pgbench -c 128 -j 128 -T 300 -S postgres The reason for posting the numbers for 64/128 threads is because we have mainly concurrency bottleneck when the number of connections are higher than CPU cores and I am using 16 cores, 64 hardware threads m/c. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Sat, May 17, 2014 at 6:02 AM, Robert Haas robertmh...@gmail.com wrote: I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault. In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different. Is there anything bad except for may be increase in LWLocks with scaling hash partitions w.r.t to shared buffers either by auto tuning or by having a configuration knob. I understand that it would be bit difficult for users to estimate the correct value of such a parameter, we can provide info about its usage in docs such that if user increases shared buffers by 'X' (20 times) of default value (128M), then consider increasing such partitions and it should be always power of 2 or does something similar to above internally in code. I agree that may be even by having a reasonably good estimate of number of partitions w.r.t shared buffers, we might not be able to eliminate the contention around BufMappingLocks, but I think the scalability we get by doing that is not bad either. I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now: I have still not read the complete code, but by just going through initial file header, it seems to me that it will be much better than current implementation in terms of concurrency, by the way does such an implementation can extend to reducing scalability for hash indexes as well? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Scaling shared buffer eviction
As mentioned previously about my interest in improving shared buffer eviction especially by reducing contention around BufFreelistLock, I would like to share my progress about the same. The test used for this work is mainly the case when all the data doesn't fit in shared buffers, but does fit in memory. It is mainly based on previous comparison done by Robert for similar workload: http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html To start with, I have taken LWLOCK_STATS report to confirm the contention around BufFreelistLock and the data for HEAD is as follows: M/c details IBM POWER-7 16 cores, 64 hardware threads RAM - 64GB Test scale factor = 3000 shared_buffers = 8GB number_of_threads = 64 duration = 5mins ./pgbench -c 64 -j 64 -T 300 -S postgres LWLOCK_STATS data for BufFreeListLock PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023 Here the high *blk* count for scale factor 3000 clearly shows that to find a usable buffer when data doesn't fit in shared buffers it has to wait. To solve this issue, I have implemented a patch which makes sure that there are always enough buffers on freelist such that the need for backend to run clock-sweep is minimal, the implementation idea is more or less same as discussed previously in below thread, so I will explain it at end of mail. http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kap...@huawei.com LWLOCK_STATS data after Patch (test used is same as used for HEAD): BufFreeListLock PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0 Here the low *exacq* and *blk* count shows that the need to run clock sweep for backend has reduced significantly. Performance Data --- shared_buffers= 8GB number of threads - 64 sc - scale factor sc tps Head300045569 Patch 300046457 Head100093037 Patch 100092711 Above data shows that there is no significant change in performance or scalability even after the contention is reduced significantly around BufFreelistLock. I have analyzed the patch both with perf record and LWLOCK_STATS, both indicates that there is a high contention around BufMappingLocks. Data With perf record -a -g - + 10.14% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep + 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock + 6.88% postgres [kernel.kallsyms] [k] .function_trace_call + 4.15% pgbench [kernel.kallsyms] [k] .try_to_wake_up + 3.20% swapper [kernel.kallsyms] [k] .function_trace_call + 2.99% pgbench [kernel.kallsyms] [k] .function_trace_call + 2.41% postgres postgres [.] AllocSetAlloc + 2.38% postgres [kernel.kallsyms] [k] .try_to_wake_up + 2.27% pgbench [kernel.kallsyms] [k] ._raw_spin_lock + 1.49% postgres [kernel.kallsyms] [k] ._raw_spin_lock_irq + 1.36% postgres postgres [.] AllocSetFreeIndex + 1.09% swapper [kernel.kallsyms] [k] ._raw_spin_lock + 0.91% postgres postgres [.] GetSnapshotData + 0.90% postgres postgres [.] MemoryContextAllocZeroAligned Expanded graph -- - 10.14% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep - .pseries_dedicated_idle_sleep - 10.13% .pseries_dedicated_idle_sleep - 10.13% .cpu_idle - 10.00% .start_secondary .start_secondary_prolog - 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock - ._raw_spin_lock - 6.63% ._raw_spin_lock - 5.95% .double_rq_lock - .load_balance - 5.95% .__schedule - .schedule - 3.27% .SyS_semtimedop .SyS_ipc syscall_exit semop PGSemaphoreLock LWLockAcquireCommon - LWLockAcquire - 3.27% BufferAlloc ReadBuffer_common - ReadBufferExtended - 3.27% ReadBuffer - 2.73% ReleaseAndReadBuffer - 1.70% _bt_relandgetbuf _bt_search _bt_first btgettuple It shows BufferAlloc-LWLOCK as top contributor and we use BufMappingLocks in BufferAlloc, I have checked other expanded calls as well, StrategyGetBuffer is not present in top contributors. Data with LWLOCK_STATS -- BufMappingLocks PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274