Re: [HACKERS] Scaling shared buffer eviction
On Tue, Nov 11, 2014 at 3:00 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-11 09:29:22 +, Thom Brown wrote: On 26 September 2014 12:40, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com wrote: But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. First of all thanks for committing part-1 of this changes and it seems you are planing to commit part-3 based on results of tests which Andres is planing to do and for remaining part (part-2), today Were parts 2 and 3 committed in the end? 3 was committed. 2 wasn't because it's not yet clear whether how beneficial it is generally. As shown upthread that this patch (as it stands today) is dependent on another patch (wait free LW_SHARED acquisition) which is still not committed and still some more work is needed to clearly show the gain by this patch, so I have marked it as Returned with Feedback. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 26 September 2014 12:40, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com wrote: But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. First of all thanks for committing part-1 of this changes and it seems you are planing to commit part-3 based on results of tests which Andres is planing to do and for remaining part (part-2), today Were parts 2 and 3 committed in the end? -- Thom
Re: [HACKERS] Scaling shared buffer eviction
On 2014-11-11 09:29:22 +, Thom Brown wrote: On 26 September 2014 12:40, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com wrote: But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. First of all thanks for committing part-1 of this changes and it seems you are planing to commit part-3 based on results of tests which Andres is planing to do and for remaining part (part-2), today Were parts 2 and 3 committed in the end? 3 was committed. 2 wasn't because it's not yet clear whether how beneficial it is generally. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Oct 14, 2014 at 3:24 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote: On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. Please find the rebased patch attached with this mail. I have taken some performance data as well and done some analysis based on the same. To reduce above contention, I tried to write a patch to replace spin lock used in dynahash to manage free list by atomic operations. Still there is work pending for this patch with respect to ensuring whether the approach used in patch is completely sane, however I am posting the patch so that others can have a look at it and give me feedback about the approach. After further working on this patch (replacement of spinlock in dynahash), I found that I need to solve A-B-A problem for lockless structure as described in section 5.1 of paper [1]. I could have further pursued to solve it by using some additional variables as described in section 5 of paper [1], but it seems to me that Robert has already solved it by using some other technique as proposed by him in patch [2], so it would be waste of effort for me to pursue on this problem. So I am not planning to continue on this patch and marking it as Rejected in CF app. [1] http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf [2] https://commitfest.postgresql.org/action/patch_view?id=1613 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Oct 14, 2014 at 3:32 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-14 15:24:57 +0530, Amit Kapila wrote: After that I observed that contention for LW_SHARED has reduced for this load, but it didn't help much in terms of performance, so I again rechecked the profile and this time most of the contention is moved to spinlock used in dynahash for buf mapping tables, please refer the profile (for 128 client count; Read only load) below: bgreclaimer patch + wait free lw_shared acquisition patches - -- This profile is without -O2 again. I really don't think it makes much sense to draw much inference from an unoptimized build. Profile data with -O2 is below. This shows that top contributors are calls for BufTableLookup and spin lock caused by BufTableInsert and BufTableDelete. To resolve spin lock contention, patch like above might prove to be useful (although I have to still evaluate the same). I would like to once take LWLOCK_STATS data as well before proceeding further. Do you have any other ideas? 11.17% swapper [unknown] [H] 0x011e0328 + 4.62% postgres postgres[.] hash_search_with_hash_value + 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group + 3.71% postgres postgres[.] s_lock 2.56% postgres [unknown] [H] 0x01500120 + 2.23% pgbench [kernel.kallsyms] [k] .idle_cpu + 1.97% postgres postgres[.] LWLockAttemptLock + 1.73% postgres postgres[.] LWLockRelease + 1.47% postgres [kernel.kallsyms] [k] .__copy_tofrom_user_power7 + 1.44% postgres postgres[.] GetSnapshotData + 1.28% postgres postgres[.] _bt_compare + 1.04% swapper [kernel.kallsyms] [k] .int_sqrt + 1.04% postgres postgres[.] AllocSetAlloc + 0.97% pgbench [kernel.kallsyms] [k] .default_wake_function Detailed Data - 4.62% postgres postgres[.] hash_search_with_hash_value - hash_search_with_hash_value - 2.19% BufTableLookup - 2.15% BufTableLookup ReadBuffer_common - ReadBufferExtended - 1.32% _bt_relandgetbuf - 0.73% BufTableDelete - 0.71% BufTableDelete ReadBuffer_common ReadBufferExtended - 0.69% BufTableInsert - 0.68% BufTableInsert ReadBuffer_common ReadBufferExtended 0.66% hash_search_with_hash_value - 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group - .find_busiest_group - 4.28% .find_busiest_group - 4.26% .load_balance - 4.26% .idle_balance - .__schedule - 4.26% .schedule_hrtimeout_range_clock .do_select .core_sys_select - 3.71% postgres postgres[.] s_lock - s_lock - 3.19% hash_search_with_hash_value - 3.18% hash_search_with_hash_value - 1.60% BufTableInsert ReadBuffer_common - ReadBufferExtended - 1.57% BufTableDelete ReadBuffer_common - ReadBufferExtended - 0.93% index_fetch_heap With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote: On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. Please find the rebased patch attached with this mail. I have taken some performance data as well and done some analysis based on the same. Below data is median of 3 runs. patch_ver/client_count 1 8 32 64 128 256 HEAD 18884 118628 251093 216294 186625 177505 PATCH 18743 122578 247243 205521 179712 175031 Here we can see that the performance dips at higher client count(=32) which was quite surprising for me, as I was expecting it to improve, because bgreclaimer reduces the contention by making buffers available on free list. So I tried to analyze the situation by using perf and found that in above configuration, there is a contention around freelist spinlock with HEAD and the same is removed by Patch, but still the performance goes down with Patch. On further analysis, I observed that actually after Patch there is an increase in contention around ProcArrayLock (shared LWlock) via GetSnapshotData which sounds bit odd, but that's what I can see in profiles. Based on analysis, few ideas which I would like to further investigate are: a. As there is an increase in spinlock contention, I would like to check with Andres's latest patch which reduces contention around shared lwlocks. I have tried by applying Andres's Wait free LW_SHARED acquisition patch posted by him at below link along with bgreclaimer patch: http://www.postgresql.org/message-id/20141008133533.ga5...@alap3.anarazel.de After that I observed that contention for LW_SHARED has reduced for this load, but it didn't help much in terms of performance, so I again rechecked the profile and this time most of the contention is moved to spinlock used in dynahash for buf mapping tables, please refer the profile (for 128 client count; Read only load) below: bgreclaimer patch + wait free lw_shared acquisition patches - -- 9.24% swapper [unknown] [H] 0x011d9c10 + 7.19% postgres postgres[.] s_lock + 3.52% postgres postgres[.] GetSnapshotData + 3.02% postgres postgres[.] calc_bucket + 2.71% postgres postgres[.] hash_search_with_hash_value 2.32% postgres [unknown] [H] 0x011e0d7c + 2.17% postgres postgres[.] pg_atomic_fetch_add_u32_impl + 1.84% postgres postgres[.] AllocSetAlloc + 1.57% postgres postgres[.] _bt_compare + 1.05% postgres postgres[.] AllocSetFreeIndex + 1.02% postgres [kernel.kallsyms] [k] .__copy_tofrom_user_power7 + 0.94% postgres postgres[.] tas + 0.85% swapper [kernel.kallsyms] [k] .int_sqrt + 0.80% postgres postgres[.] pg_encoding_mbcliplen + 0.78% pgbench [kernel.kallsyms] [k] .find_busiest_group 0.65% pgbench [unknown] [H] 0x011d96e0 + 0.59% postgres postgres[.] hash_any + 0.54% postgres postgres[.] LWLockRelease Detailed Profile Data - - 7.19% postgres postgres[.] s_lock - s_lock - 3.79% s_lock - 1.69% get_hash_entry hash_search_with_hash_value BufTableInsert BufferAlloc 0 - 1.63% hash_search_with_hash_value - 1.63% BufTableDelete BufferAlloc ReadBuffer_common ReadBufferExtended - 1.45% hash_search_with_hash_value - 1.45% hash_search_with_hash_value BufTableDelete BufferAlloc ReadBuffer_common - 1.43% get_hash_entry - 1.43% get_hash_entry hash_search_with_hash_value BufTableInsert BufferAlloc ReadBuffer_common - 3.52% postgres postgres[.] GetSnapshotData - GetSnapshotData - 3.50% GetSnapshotData - 3.49% GetTransactionSnapshot - 1.75% exec_bind_message PostgresMain 0 - 1.74% PortalStart exec_bind_message PostgresMain 0 To reduce above contention, I tried to write a patch to replace spin lock used in dynahash to manage free list by atomic operations. Still there is work pending for this patch with respect to ensuring whether the approach used in
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-14 15:24:57 +0530, Amit Kapila wrote: After that I observed that contention for LW_SHARED has reduced for this load, but it didn't help much in terms of performance, so I again rechecked the profile and this time most of the contention is moved to spinlock used in dynahash for buf mapping tables, please refer the profile (for 128 client count; Read only load) below: bgreclaimer patch + wait free lw_shared acquisition patches - -- 9.24% swapper [unknown] [H] 0x011d9c10 + 7.19% postgres postgres[.] s_lock + 3.52% postgres postgres[.] GetSnapshotData + 3.02% postgres postgres[.] calc_bucket + 2.71% postgres postgres[.] hash_search_with_hash_value 2.32% postgres [unknown] [H] 0x011e0d7c + 2.17% postgres postgres[.] pg_atomic_fetch_add_u32_impl + 1.84% postgres postgres[.] AllocSetAlloc + 1.57% postgres postgres[.] _bt_compare + 1.05% postgres postgres[.] AllocSetFreeIndex + 1.02% postgres [kernel.kallsyms] [k] .__copy_tofrom_user_power7 + 0.94% postgres postgres[.] tas + 0.85% swapper [kernel.kallsyms] [k] .int_sqrt + 0.80% postgres postgres[.] pg_encoding_mbcliplen + 0.78% pgbench [kernel.kallsyms] [k] .find_busiest_group 0.65% pgbench [unknown] [H] 0x011d96e0 + 0.59% postgres postgres[.] hash_any + 0.54% postgres postgres[.] LWLockRelease This profile is without -O2 again. I really don't think it makes much sense to draw much inference from an unoptimized build. I realize that you said that the builds you use for benchmarking don't have that problem, but that doesn't make this profile meaningful... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-09 16:01:55 +0200, Andres Freund wrote: I don't think OLTP really is the best test case for this. Especially not pgbench with relatilvely small rows *and* a uniform distribution of access. Try parallel COPY TO. Batch write loads is where I've seen this hurt badly. just by switching shared_buffers from 1 to 8GB. I haven't tried, but I hope that with an approach like your's this might become better. psql -f /tmp/prepare.sql pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100 Thanks for providing the scripts. You haven't specified how much data is present in 'large' file used in tests. I have tried with different set of rows, but I could not see the dip that is present in your data when you increased shared buffers from 1GB to 8GB, also I don't see any difference with patch. BTW, why do you think that for such worklaods this patch can be helpful, according to my understanding it can be helpful mainly for read mostly workloads when all the data doesn't fit in shared buffers. Performance Data --- IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB For 50 rows Data populated using below statement: insert into largedata_64 ('aa',generate_series(1,50)); copy largedata_64 to '/tmp/large' binary; pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100 shared_buffers - 1GB - progress: 7.0 s, 2.7 tps, lat 2326.645 ms stddev 173.506 progress: 11.5 s, 3.5 tps, lat 2295.577 ms stddev 78.949 progress: 15.8 s, 3.7 tps, lat 2298.217 ms stddev 223.346 progress: 20.4 s, 3.5 tps, lat 2350.187 ms stddev 192.312 progress: 25.1 s, 3.4 tps, lat 2280.206 ms stddev 54.580 progress: 31.9 s, 3.4 tps, lat 2408.593 ms stddev 243.230 progress: 45.2 s, 1.1 tps, lat 5120.151 ms stddev 3913.561 progress: 50.5 s, 1.3 tps, lat 8967.954 ms stddev 3384.229 progress: 52.7 s, 2.7 tps, lat 3883.788 ms stddev 1733.293 progress: 55.6 s, 3.2 tps, lat 2684.282 ms stddev 348.615 progress: 58.2 s, 3.4 tps, lat 2602.355 ms stddev 268.718 progress: 60.8 s, 3.1 tps, lat 2361.937 ms stddev 302.643 progress: 65.3 s, 3.5 tps, lat 2341.903 ms stddev 162.338 progress: 74.1 s, 2.6 tps, lat 2720.182 ms stddev 716.425 progress: 76.4 s, 3.5 tps, lat 3023.234 ms stddev 670.473 progress: 80.4 s, 2.0 tps, lat 2795.323 ms stddev 820.429 progress: 85.6 s, 1.9 tps, lat 4756.217 ms stddev 844.284 progress: 91.9 s, 2.2 tps, lat 3996.001 ms stddev 1301.143 progress: 96.6 s, 3.5 tps, lat 2284.419 ms stddev 85.013 progress: 101.1 s, 3.5 tps, lat 2282.848 ms stddev 71.388 transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 8 number of threads: 8 duration: 100 s number of transactions actually processed: 275 latency average: 2939.784 ms latency stddev: 1739.974 ms tps = 2.710138 (including connections establishing) tps = 2.710208 (excluding connections establishing) shared_buffers - 8GB progress: 6.7 s, 2.7 tps, lat 2349.816 ms stddev 212.889 progress: 11.0 s, 3.5 tps, lat 2257.364 ms stddev 141.148 progress: 15.2 s, 3.8 tps, lat 2209.669 ms stddev 127.101 progress: 21.7 s, 3.7 tps, lat 2159.838 ms stddev 92.205 progress: 25.8 s, 3.9 tps, lat 2221.072 ms stddev 283.362 progress: 30.1 s, 3.5 tps, lat 2179.611 ms stddev 152.741 progress: 39.3 s, 2.1 tps, lat 2768.609 ms stddev 1265.508 progress: 50.9 s, 1.1 tps, lat 9361.388 ms stddev 2657.885 progress: 52.9 s, 1.0 tps, lat 2036.098 ms stddev 3.599 progress: 55.2 s, 4.3 tps, lat 2167.688 ms stddev 91.183 progress: 57.6 s, 3.0 tps, lat 2399.219 ms stddev 173.535 progress: 60.2 s, 4.1 tps, lat 2427.273 ms stddev 198.698 progress: 65.2 s, 3.4 tps, lat 2441.630 ms stddev 123.773 progress: 72.4 s, 2.9 tps, lat 2534.631 ms stddev 254.162 progress: 75.0 s, 3.9 tps, lat 2468.266 ms stddev 221.969 progress: 82.3 s, 3.0 tps, lat 2548.690 ms stddev 404.852 progress: 86.7 s, 1.4 tps, lat 3980.576 ms stddev 1205.743 progress: 92.5 s, 1.4 tps, lat 5174.340 ms stddev 643.415 progress: 97.1 s, 3.7 tps, lat 3252.847 ms stddev 1689.268 progress: 101.8 s, 3.4 tps, lat 2346.690 ms stddev 138.251 transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 8 number of threads: 8 duration: 100 s number of transactions actually processed: 284 latency average: 2856.195 ms latency stddev: 1740.699 ms tps = 2.781603 (including connections establishing) tps = 2.781682 (excluding connections establishing) For 5000 rows shared_buffers - 1GB --- progress: 5.0 s, 357.7 tps, lat 22.295 ms stddev 3.511 progress: 10.0 s, 339.0 tps, lat 23.606 ms stddev 4.388 progress: 15.0 s, 323.4 tps, lat 24.733 ms stddev 5.001 progress: 20.0 s, 329.6 tps, lat 24.258 ms stddev 4.407 progress: 25.0 s, 334.3 tps, lat 23.963 ms stddev 4.126 progress: 30.0 s, 337.5 tps, lat 23.699 ms
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-10 12:28:13 +0530, Amit Kapila wrote: On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-09 16:01:55 +0200, Andres Freund wrote: I don't think OLTP really is the best test case for this. Especially not pgbench with relatilvely small rows *and* a uniform distribution of access. Try parallel COPY TO. Batch write loads is where I've seen this hurt badly. just by switching shared_buffers from 1 to 8GB. I haven't tried, but I hope that with an approach like your's this might become better. psql -f /tmp/prepare.sql pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100 Thanks for providing the scripts. You haven't specified how much data is present in 'large' file used in tests. I don't think it matters much. It should be small enough that you get a couple TPS over all backends. I have tried with different set of rows, but I could not see the dip that is present in your data when you increased shared buffers from 1GB to 8GB, also I don't see any difference with patch. Interesting. I wonder whether that's because the concurrency wasn't high enough for that machine to show the problem. I ran the test on my workstation which has 8 actual cores. BTW, why do you think that for such worklaods this patch can be helpful, according to my understanding it can be helpful mainly for read mostly workloads when all the data doesn't fit in shared buffers. The performance dip comes from all the backends performing the clock sweep. As the access is pretty uniform all the buffers start with some usage count (IIRC 3 in this example. Much worse if 5). Due to the uniform usagecount the clock sweep frequently has to go several times through *all* the buffers. That leads to quite horrible performance in some cases. I had hoped that bgreclaimer can make that workload les horrible by funneling most of the accesses through the freelist. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-09 18:17:09 +0530, Amit Kapila wrote: On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote: On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. Please find the rebased patch attached with this mail. I have taken some performance data as well and done some analysis based on the same. Performance Data IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB max_connections =300 Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min shared_buffers=8GB scale factor = 5000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins I don't think OLTP really is the best test case for this. Especially not pgbench with relatilvely small rows *and* a uniform distribution of access. Try parallel COPY TO. Batch write loads is where I've seen this hurt badly. patch_ver/client_count 1 8 32 64 128 256 HEAD 18884 118628 251093 216294 186625 177505 PATCH 18743 122578 247243 205521 179712 175031 So, pretty much no benefits on any scale, right? Here we can see that the performance dips at higher client count(=32) which was quite surprising for me, as I was expecting it to improve, because bgreclaimer reduces the contention by making buffers available on free list. So I tried to analyze the situation by using perf and found that in above configuration, there is a contention around freelist spinlock with HEAD and the same is removed by Patch, but still the performance goes down with Patch. On further analysis, I observed that actually after Patch there is an increase in contention around ProcArrayLock (shared LWlock) via GetSnapshotData which sounds bit odd, but that's what I can see in profiles. Based on analysis, few ideas which I would like to further investigate are: a. As there is an increase in spinlock contention, I would like to check with Andres's latest patch which reduces contention around shared lwlocks. b. Reduce some instructions added by patch in StrategyGetBuffer(), like instead of awakening bgreclaimer at low threshold, awaken when it tries to do clock sweep. Are you sure you didn't mix up the profiles here? The head vs. patched look more like profiles from different client counts than different versions of the code. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Oct 9, 2014 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-09 18:17:09 +0530, Amit Kapila wrote: On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote: On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. Please find the rebased patch attached with this mail. I have taken some performance data as well and done some analysis based on the same. Performance Data IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB max_connections =300 Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min shared_buffers=8GB scale factor = 5000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins I don't think OLTP really is the best test case for this. Especially not pgbench with relatilvely small rows *and* a uniform distribution of access. Try parallel COPY TO. Batch write loads is where I've seen this hurt badly. patch_ver/client_count 1 8 32 64 128 256 HEAD 18884 118628 251093 216294 186625 177505 PATCH 18743 122578 247243 205521 179712 175031 So, pretty much no benefits on any scale, right? Almost Right, there seem to be slight benefit at client count 8, however that can be due to variation as well. Here we can see that the performance dips at higher client count(=32) which was quite surprising for me, as I was expecting it to improve, because bgreclaimer reduces the contention by making buffers available on free list. So I tried to analyze the situation by using perf and found that in above configuration, there is a contention around freelist spinlock with HEAD and the same is removed by Patch, but still the performance goes down with Patch. On further analysis, I observed that actually after Patch there is an increase in contention around ProcArrayLock (shared LWlock) via GetSnapshotData which sounds bit odd, but that's what I can see in profiles. Based on analysis, few ideas which I would like to further investigate are: a. As there is an increase in spinlock contention, I would like to check with Andres's latest patch which reduces contention around shared lwlocks. b. Reduce some instructions added by patch in StrategyGetBuffer(), like instead of awakening bgreclaimer at low threshold, awaken when it tries to do clock sweep. Are you sure you didn't mix up the profiles here? I have tried this 2 times, basically I am quite confident from myside, but human errors can't be ruled out. I have used below statements: Steps used for profiling during configure, use CFLAS=-fno-omit-frame-pointer Terminal -1 Start Server Terminal -2 ./pgbench -c 64 -j 64 -T 300 -S -M prepared postgres Terminal-3 perf record -a -g sleep 60 --This command is run after a minute or so of -- test start After test is finished - perf report -g graph,0.5,callee Do you see any problem in the way I am collecting perf reports? In any case, I can try once more if you still doubt the profiles. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-09 16:01:55 +0200, Andres Freund wrote: On 2014-10-09 18:17:09 +0530, Amit Kapila wrote: On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote: On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. Please find the rebased patch attached with this mail. I have taken some performance data as well and done some analysis based on the same. Performance Data IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB max_connections =300 Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min shared_buffers=8GB scale factor = 5000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins I don't think OLTP really is the best test case for this. Especially not pgbench with relatilvely small rows *and* a uniform distribution of access. Try parallel COPY TO. Batch write loads is where I've seen this hurt badly. As an example. The attached scripts go from: progress: 5.3 s, 20.9 tps, lat 368.917 ms stddev 49.655 progress: 10.1 s, 21.0 tps, lat 380.326 ms stddev 64.525 progress: 15.1 s, 14.1 tps, lat 568.108 ms stddev 226.040 progress: 20.4 s, 12.0 tps, lat 634.557 ms stddev 300.519 progress: 25.2 s, 17.5 tps, lat 461.738 ms stddev 136.257 progress: 30.2 s, 9.8 tps, lat 850.766 ms stddev 305.454 progress: 35.3 s, 12.2 tps, lat 670.473 ms stddev 271.075 progress: 40.2 s, 7.9 tps, lat 972.617 ms stddev 313.152 progress: 45.3 s, 14.9 tps, lat 546.056 ms stddev 211.987 progress: 50.2 s, 13.2 tps, lat 610.608 ms stddev 271.780 progress: 55.5 s, 16.9 tps, lat 468.757 ms stddev 156.516 progress: 60.5 s, 14.3 tps, lat 548.913 ms stddev 190.414 progress: 65.7 s, 9.3 tps, lat 821.293 ms stddev 353.665 progress: 70.1 s, 16.0 tps, lat 524.240 ms stddev 174.903 progress: 75.2 s, 17.0 tps, lat 485.692 ms stddev 194.273 progress: 80.2 s, 19.9 tps, lat 396.295 ms stddev 78.891 progress: 85.3 s, 18.3 tps, lat 423.744 ms stddev 105.798 progress: 90.1 s, 14.5 tps, lat 577.373 ms stddev 270.914 progress: 95.3 s, 12.0 tps, lat 649.434 ms stddev 247.001 progress: 100.3 s, 14.6 tps, lat 563.693 ms stddev 275.236 tps = 14.81 (including connections establishing) to: progress: 5.1 s, 18.9 tps, lat 409.766 ms stddev 75.032 progress: 10.3 s, 20.2 tps, lat 396.781 ms stddev 67.593 progress: 15.1 s, 19.1 tps, lat 418.545 ms stddev 109.431 progress: 20.3 s, 20.6 tps, lat 388.606 ms stddev 74.259 progress: 25.1 s, 19.5 tps, lat 406.591 ms stddev 109.050 progress: 30.0 s, 19.1 tps, lat 420.199 ms stddev 157.005 progress: 35.0 s, 18.4 tps, lat 421.102 ms stddev 124.019 progress: 40.3 s, 12.3 tps, lat 640.640 ms stddev 88.409 progress: 45.2 s, 12.8 tps, lat 586.471 ms stddev 145.543 progress: 50.5 s, 6.9 tps, lat 1116.603 ms stddev 285.479 progress: 56.2 s, 6.3 tps, lat 1349.055 ms stddev 381.095 progress: 60.6 s, 7.9 tps, lat 1083.745 ms stddev 452.386 progress: 65.0 s, 9.6 tps, lat 805.981 ms stddev 273.845 progress: 71.1 s, 9.6 tps, lat 798.273 ms stddev 184.108 progress: 75.2 s, 9.3 tps, lat 950.131 ms stddev 150.870 progress: 80.8 s, 8.6 tps, lat 899.389 ms stddev 135.090 progress: 85.3 s, 8.8 tps, lat 928.183 ms stddev 152.056 progress: 90.9 s, 8.0 tps, lat 929.737 ms stddev 71.155 progress: 95.7 s, 9.0 tps, lat 968.070 ms stddev 127.824 progress: 100.3 s, 8.7 tps, lat 911.767 ms stddev 130.697 just by switching shared_buffers from 1 to 8GB. I haven't tried, but I hope that with an approach like your's this might become better. psql -f /tmp/prepare.sql pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services CREATE OR REPLACE FUNCTION exec(text) returns text language plpgsql volatile AS $f$ BEGIN EXECUTE $1; RETURN $1; END; $f$; \o /dev/null SELECT exec('drop table if exists largedata_'||g.i||'; create unlogged table largedata_'||g.i||'(data bytea, id serial primary key);') FROM generate_series(0, 64) g(i); \o COPY largedata_:client_id(data) FROM '/tmp/large' BINARY; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 10:42:29 -0400, Robert Haas wrote: On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-25 10:22:47 -0400, Robert Haas wrote: On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com wrote: That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Amit has some results upthread showing 64 being good, but not as good as 128. I haven't verified that myself, but have no reason to doubt it. How about you push the spinlock change and I crosscheck the partition number on a multi socket x86 machine? Seems worthwile to make sure that it doesn't cause problems on x86. I seriously doubt it'll, but ... OK. Given that the results look good, do you plan to push this? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote: OK. Given that the results look good, do you plan to push this? By this, you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? If so, and if you don't have any reservations, yeah I'll go change it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-02 10:40:30 -0400, Robert Haas wrote: On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote: OK. Given that the results look good, do you plan to push this? By this, you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like #define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64) or something like that? If so, and if you don't have any reservations, yeah I'll go change it. Yes, I'm happy with going forward. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-02 10:40:30 -0400, Robert Haas wrote: On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote: OK. Given that the results look good, do you plan to push this? By this, you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like #define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64) or something like that? Nah. That assumes NUM_BUFFER_PARTITIONS will always be the biggest thing, and I don't see any reason to assume that, even if we're making it true for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-02 10:56:05 -0400, Robert Haas wrote: On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-10-02 10:40:30 -0400, Robert Haas wrote: On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote: OK. Given that the results look good, do you plan to push this? By this, you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like #define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64) or something like that? Nah. That assumes NUM_BUFFER_PARTITIONS will always be the biggest thing, and I don't see any reason to assume that, even if we're making it true for now. The reason I'm suggesting is that NUM_BUFFER_PARTITIONS (and NUM_LOCK_PARTITIONS) are the cases where we can expect many lwlocks to be held at the same time. It doesn't seem friendly to users experimenting with changing this to know about a define that's private to lwlock.c. But I'm fine with doing this not at all or separately - there's no need to actually do it together. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 10/02/2014 05:40 PM, Robert Haas wrote: On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote: OK. Given that the results look good, do you plan to push this? By this, you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Hmm, do we actually ever need to hold all the buffer partition locks at the same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I couldn't find any place where we'd do that. I bumped up NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did make check. It passed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-02 20:04:58 +0300, Heikki Linnakangas wrote: On 10/02/2014 05:40 PM, Robert Haas wrote: On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote: OK. Given that the results look good, do you plan to push this? By this, you mean the increase in the number of buffer mapping partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS? Hmm, do we actually ever need to hold all the buffer partition locks at the same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I couldn't find any place where we'd do that. I bumped up NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did make check. It passed. Do a make check-world and it'll hopefully fail ;). Check pg_buffercache_pages.c. I'd actually quite like to have a pg_buffercache version that, at least optionally, doesn't do this, but that's a separate thing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Oct 2, 2014 at 1:07 PM, Andres Freund and...@2ndquadrant.com wrote: Do a make check-world and it'll hopefully fail ;). Check pg_buffercache_pages.c. Yep. Committed, with an update to the comments in lwlock.c to allude to the pg_buffercache issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-10-01 20:54:39 +0200, Andres Freund wrote: Here we go. Postgres was configured with. -c shared_buffers=8GB \ -c log_line_prefix=[%m %p] \ -c log_min_messages=debug1 \ -p 5440 \ -c checkpoint_segments=600 -c max_connections=200 Robert reminded me that I missed to report the hardware aspect of this... 4x E5-4620 @ 2.20GHz: 32 cores, 64 threads 256GB RAM Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Part of this patch was already committed, and the overall patch has had its fair share of review for this commitfest, so I'm marking this as Returned with feedback. The benchmark results for the bgreclaimer showed a fairly small improvement, so it doesn't seem like anyone's going to commit the rest of the patch soon, not without more discussion and testing anyway. Let's not hold up the commitfest for that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com wrote: But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. First of all thanks for committing part-1 of this changes and it seems you are planing to commit part-3 based on results of tests which Andres is planing to do and for remaining part (part-2), today I have tried some tests, the results of which are as follows: Scale Factor - 3000, Shared_buffer - 8GB Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309 Scale Factor - 3000, Shared_buffer - 8GB Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720 The results indicates that in all cases there is benefit by doing part-2 (bgreclaimer). Though the benefit at this configuration is not high, but might be at some higher configurations (scale factor - 1) there is more benefit. Do you see any merit in pursuing further to accomplish part-2 as well? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 09/25/2014 05:40 PM, Andres Freund wrote: There's two reasons for that: a) dynahash just isn't very good and it does a lot of things that will never be necessary for these hashes. b) the key into the hash table is*far* too wide. A significant portion of the time is spent comparing buffer/lock tags. Hmm. Is it the comparing, or calculating the hash? We could precalculate the hash for RelFileNode+ForkNumber, and store it RelationData. At a lookup, you'd only need to mix in the block number. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote: On 09/25/2014 05:40 PM, Andres Freund wrote: There's two reasons for that: a) dynahash just isn't very good and it does a lot of things that will never be necessary for these hashes. b) the key into the hash table is*far* too wide. A significant portion of the time is spent comparing buffer/lock tags. Hmm. Is it the comparing, or calculating the hash? Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. Besides making the hashtable more efficent, a smaller key (say, 4 byte relfilenode, 4 byte blocknumber) would also make using a radix tree or similar more realistic. I've prototyped that once and it has nice properties, but the tree is too deep. Obviousy it'd also help making buffer descriptors smaller, which is also good from a cache efficiency perspective... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila amit.kapil...@gmail.com wrote: First of all thanks for committing part-1 of this changes and it seems you are planing to commit part-3 based on results of tests which Andres is planing to do and for remaining part (part-2), today I have tried some tests, the results of which are as follows: Scale Factor - 3000, Shared_buffer - 8GB Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309 Scale Factor - 3000, Shared_buffer - 8GB Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720 The results indicates that in all cases there is benefit by doing part-2 (bgreclaimer). Though the benefit at this configuration is not high, but might be at some higher configurations (scale factor - 1) there is more benefit. Do you see any merit in pursuing further to accomplish part-2 as well? Interesting results. Thanks for gathering this data. If this is the best we can do with the bgreclaimer, I think the case for pursuing it is somewhat marginal. The biggest jump you've got seems to be at scale factor 3000 with 64 clients, where you picked up about 4%. 4% isn't nothing, but it's not a lot, either. On the other hand, this might not be the best we can do. There may be further improvements to bgreclaimer that make the benefit larger. Backing up a it, to what extent have we actually solved the problem here? If we had perfectly removed all of the scalability bottlenecks, what would we expect to see? You didn't say which machine this testing was done on, or how many cores it had, but for example on the IBM POWER7 machine, we probably wouldn't expect the throughput at 64 clients to be 4 times the throughput at 16 cores because up to 16 clients each one can have a full CPU core, whereas after that and out to 64 each is getting a hardware thread, which is not quite as good. Still, we'd expect performance to go up, or at least not go down. Your data shows a characteristic performance knee: between 16 and 32 clients we go up, but then between 32 and 64 we go down, and between 64 and 128 we go down more. You haven't got enough data points there to show very precisely where the knee is, but unless you tested this on a smaller box than what you have been using, we're certainly hitting the knee sometime before we run out of physical cores. That implies a remaining contention bottleneck. My results from yesterday were a bit different. I tested 1 client, 8 clients, and multiples of 16 clients out to 96. With reduce-replacement-locking.patch + 128 buffer mapping partitions, performance continued to rise all the way out to 96 clients. It definitely wasn't linearly, but it went up, not down. I don't know why this is different from what you are seeing. Anyway there's a little more ambiguity there about how much contention remains, but my bet is that there is at least some contention that we could still hope to remove. We need to understand where that contention is. Are the buffer mapping locks still contended? Is the new spinlock contended? Are there other contention points? I won't be surprised if it turns out that the contention is on the new spinlock and that a proper design for bgreclaimer is the best way to remove that contention but I also won't be surprised if it turns out that there are bigger wins elsewhere. So I think you should try to figure out where the remaining contention is first, and then we can discuss what to do about it. On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Scaling shared buffer eviction
On 09/26/2014 03:26 PM, Andres Freund wrote: On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote: On 09/25/2014 05:40 PM, Andres Freund wrote: There's two reasons for that: a) dynahash just isn't very good and it does a lot of things that will never be necessary for these hashes. b) the key into the hash table is*far* too wide. A significant portion of the time is spent comparing buffer/lock tags. Hmm. Is it the comparing, or calculating the hash? Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. Hmm. We could provide a custom compare function instead of relying on memcmp. We can do somewhat better than generic memcmo when we know that the BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and it's always exactly 20 bytes. I wonder if you're actually just seeing a cache miss showing up in the profile, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund and...@2ndquadrant.com wrote: Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. Would it be faster to test the relfilenode first, and then test the rest only if that matches? One idea for improving this is to get rid of relation forks. Like all true PostgreSQL patriots, I love the free space map and the visibility map passionately, and I believe that those features are one of the biggest contributors to the success of the project over the last half-decade. But I'd love them even more if they didn't triple our rate of inode consumption and bloat our buffer tags. More, it's just not an extensible mechanism: too many things have to loop over all forks, and it just doesn't scale to keep adding more of them. If we added a metapage to each heap, we could have the FSM and VM have their own relfilenode and just have the heap point at them. Or (maybe better still) we could store the data in the heap itself. It would be a lot of work, though. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote: Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. I'm not arguing against a more efficient hash table, but one simple win would be to have a buffer tag specific comparison function. I mean something like the attached patch. This way we avoid the loop counter overhead, can check the blocknum first and possibly have better branch prediction. Do you have a workload where I could test if this helps alleviate the comparison overhead? Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c index 7a38f2f..ba37b5f 100644 --- a/src/backend/storage/buffer/buf_table.c +++ b/src/backend/storage/buffer/buf_table.c @@ -34,6 +34,7 @@ typedef struct static HTAB *SharedBufHash; +static int BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n); /* * Estimate space needed for mapping hashtable @@ -46,6 +47,19 @@ BufTableShmemSize(int size) } /* + * Compare contents of two buffer tags. We use this instead of the default + * memcmp to minimize comparison overhead. + */ +static int +BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n) +{ + Assert(offsetof(BufferTag, blockNum) == 2*sizeof(uint64)); + return (a-blockNum == b-blockNum + ((uint64*)a)[0] == ((uint64*)b)[0] + ((uint64*)a)[1] == ((uint64*)b)[1]) ? 0 : 1; +} + +/* * Initialize shmem hash table for mapping buffers * size is the desired hash table size (possibly more than NBuffers) */ @@ -61,6 +75,7 @@ InitBufTable(int size) info.entrysize = sizeof(BufferLookupEnt); info.hash = tag_hash; info.num_partitions = NUM_BUFFER_PARTITIONS; + info.match = BufTableCmpBufferTag; SharedBufHash = ShmemInitHash(Shared Buffer Lookup Table, size, size, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila amit.kapil...@gmail.com wrote: First of all thanks for committing part-1 of this changes and it seems you are planing to commit part-3 based on results of tests which Andres is planing to do and for remaining part (part-2), today I have tried some tests, the results of which are as follows: Scale Factor - 3000, Shared_buffer - 8GB Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157732 229547 271536 245295 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309 Scale Factor - 3000, Shared_buffer - 8GB Typo here Scale Factor - 3000, Shared_buffer - *2*GB Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch + 128 Buf Partitions 157781 212134 202209 171176 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720 The results indicates that in all cases there is benefit by doing part-2 (bgreclaimer). Though the benefit at this configuration is not high, but might be at some higher configurations (scale factor - 1) there is more benefit. Do you see any merit in pursuing further to accomplish part-2 as well? Interesting results. Thanks for gathering this data. One more point I have missed is that above data is using -M prepared option of pgbench. If this is the best we can do with the bgreclaimer, I think the case for pursuing it is somewhat marginal. The biggest jump you've got seems to be at scale factor 3000 with 64 clients, where you picked up about 4%. 4% isn't nothing, but it's not a lot, either. On the other hand, this might not be the best we can do. There may be further improvements to bgreclaimer that make the benefit larger. Backing up a it, to what extent have we actually solved the problem here? If we had perfectly removed all of the scalability bottlenecks, what would we expect to see? You didn't say which machine this testing was done on It was IBM POWER7 Sorry, I should have mentioned it. , or how many cores it had, but for example on the IBM POWER7 machine, we probably wouldn't expect the throughput at 64 clients to be 4 times the throughput at 16 cores because up to 16 clients each one can have a full CPU core, whereas after that and out to 64 each is getting a hardware thread, which is not quite as good. Still, we'd expect performance to go up, or at least not go down. Your data shows a characteristic performance knee: between 16 and 32 clients we go up, but then between 32 and 64 we go down, Here another point worth noting is that it goes down between 32 and 64 when shared_buffers is 2GB, however when shared_buffers are 8GB it doesn't go down between 32 and 64. and between 64 and 128 we go down more. You haven't got enough data points there to show very precisely where the knee is, but unless you tested this on a smaller box than what you have been using, we're certainly hitting the knee sometime before we run out of physical cores. That implies a remaining contention bottleneck. My results from yesterday were a bit different. I tested 1 client, 8 clients, and multiples of 16 clients out to 96. With reduce-replacement-locking.patch + 128 buffer mapping partitions, performance continued to rise all the way out to 96 clients. It definitely wasn't linearly, but it went up, not down. I don't know why this is different from what you are seeing. I think it is almost same if we consider same configuration (scale_factor - 3000, shared_buffer - 8GB). Anyway there's a little more ambiguity there about how much contention remains, but my bet is that there is at least some contention that we could still hope to remove. We need to understand where that contention is. Are the buffer mapping locks still contended? Is the new spinlock contended? Are there other contention points? I won't be surprised if it turns out that the contention is on the new spinlock and that a proper design for bgreclaimer is the best way to remove that contention but I also won't be surprised if it turns out that there are bigger wins elsewhere. So I think you should try to figure out where the remaining contention is first, and then we can discuss what to do about it. Make sense. On another point, I think it would be a good idea to rebase the bgreclaimer patch over what I committed, so that we have a clean patch against master to test with. I think this also makes sense, however I think it is better to first see where is the bottleneck. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-26 17:01:52 +0300, Ants Aasma wrote: On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote: Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. I'm not arguing against a more efficient hash table, but one simple win would be to have a buffer tag specific comparison function. I mean something like the attached patch. This way we avoid the loop counter overhead, can check the blocknum first and possibly have better branch prediction. Heh. Yea. As I wrote to Heikki, 64bit compares were the thing showing most benefits - at least with my own hashtable implementation. Do you have a workload where I could test if this helps alleviate the comparison overhead? Fully cached readonly pgbench workloads with -M prepared already show it. But it gets more pronounced for workload that access buffers at a higher frequency. At two customers I've seen this really badly in the profile because they have OLTP statements that some index nested loops. Often looking the same buffer up *over and over*. There often isn't a better plan (with current pg join support at least) for joining a somewhat small number of rows out of a large table to small/mid sized table. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-26 09:59:41 -0400, Robert Haas wrote: On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund and...@2ndquadrant.com wrote: Neither, really. The hash calculation is visible in the profile, but not that pronounced yet. The primary thing noticeable in profiles (besides cache efficiency) is the comparison of the full tag after locating a possible match in a bucket. 20 byte memcmp's aren't free. Would it be faster to test the relfilenode first, and then test the rest only if that matches? I tried that and I couldn't see that much benefit. Check my message to Heikki. One idea for improving this is to get rid of relation forks. I think that's the hardest end to start from. A cool goal, but hard. Getting rid of the tablespace sound comparatively simple. And even getting rid of the database in the buffer tag seems to be simpler although already pretty hard. Like all true PostgreSQL patriots, I love the free space map and the visibility map passionately, and I believe that those features are one of the biggest contributors to the success of the project over the last half-decade. But I'd love them even more if they didn't triple our rate of inode consumption and bloat our buffer tags. More, it's just not an extensible mechanism: too many things have to loop over all forks, and it just doesn't scale to keep adding more of them. If we added a metapage to each heap, we could have the FSM and VM have their own relfilenode and just have the heap point at them. Or (maybe better still) we could store the data in the heap itself. It would be a lot of work, though. :-( Yea, it's really hard. And nearly impossible to do without breaking binary compatibility. What I've been wondering about was to give individual forks their own relfilenodes and manage them via columns in pg_class. But that's also a heck of a lot of work and gets complicated for unlogged relations because we need to access those during recovery when we don't have catalog access yet and can't access all databases anyway. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas robertmh...@gmail.com wrote: The patch I attached the first time was just the last commit in the git repository where I wrote the patch, rather than the changes that I made on top of that commit. So, yes, the results from the previous message are with the patch attached to the follow-up. I just typed the wrong git command when attempting to extract that patch to attach it to the email. Here are some more results. TL;DR: The patch still looks good, but we should raise the number of buffer mapping partitions as well. On the IBM POWER7 machine, I ran read-only pgbench tests with 1 client, 8 clients, and all multiples of 16 up to 96. I ran these tests at scale factor 1000 and scale factor 3000. I tested four builds: master as of commit df4077cda2eae3eb4a5cf387da0c1e7616e73204, that same commit with the number of buffer mapping partitions raised to 128, that commit with reduce-replacement-locking.patch applied, and that commit with reduce-replacement-locking.patch applied AND the number of buffer mapping partitions raised to 128. The results from each configuration are reported in that order on each of the lines below; each is the median of three results. shared_buffers=8GB for all tests. scale factor 1000 1 8119.907618 8230.853237 8153.515217 8145.045004 8 65457.006762 65826.439701 65851.010116 65703.168020 16 125263.858855 125723.441853 125020.598728 129506.037997 32 176696.288187 182376.232631 178278.917581 186440.340283 48 193251.602743 214243.417591 197958.562641 226782.327868 64 182264.276909 218655.105894 190364.759052 256863.652885 80 171719.210488 203104.673512 179861.241080 274065.020956 96 162525.883898 190960.622943 169759.271356 277820.128782 scale factor 3000 1 7690.357053 7723.925932 7772.207513 7684.079850 8 60789.325087 61688.547446 61485.398967 62546.166411 16 112509.423777 115138.385501 115606.858594 120015.350112 32 147881.211900 161359.994902 153302.501020 173063.463752 48 129748.929652 153986.160920 136164.387103 204935.207578 64 114364.542340 132174.970721 116705.371890 224636.957891 80 101375.265389 117279.931095 102374.794412 232966.076908 96 93144.724830 106676.309224 92787.650325 233862.872939 Analysis: 1. To see the effect of reduce-replacement-locking.patch, compare the first TPS number in each line to the third, or the second to the fourth. At scale factor 1000, the patch wins in all of the cases with 32 or more clients and exactly half of the cases with 1, 8, or 16 clients. The variations at low client counts are quite small, and the patch isn't expected to do much at low concurrency levels, so that's probably just random variation. At scale factor 3000, the situation is more complicated. With only 16 bufmappinglocks, the patch gets its biggest win at 48 clients, and by 96 clients it's actually losing to unpatched master. But with 128 bufmappinglocks, it wins - often massively - on everything but the single-client test, which is a small loss, hopefully within experimental variation. 2. To see the effect of increasing the number of buffer mapping locks to 128, compare the first TPS number in each line to the second, or the third to the fourth. Without reduce-replacement-locking.patch, that's a win at every concurrency level at both scale factors. With that patch, the 1 and 8 client tests are small losses at scale factor 1000, and the 1 client test is a small loss at scale factor 3000. The single-client results, which are often a concern for scalability patches, bear a bit of further comment. In this round of testing, either patch alone improved things slightly, and both patches together made them slightly worse. Even if that is reproducible, I don't think it should be cause for concern, because it tends to indicate (at least to me) that the shifting around is just the result of slightly different placement of code across cache lines, or other minor factors we can't really pin down. So I'm inclined to (a) push reduce-replacement-locking.patch and then also (b) bump up the number of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS accordingly so that pg_buffercache doesn't get unhappy). Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote: 1. To see the effect of reduce-replacement-locking.patch, compare the first TPS number in each line to the third, or the second to the fourth. At scale factor 1000, the patch wins in all of the cases with 32 or more clients and exactly half of the cases with 1, 8, or 16 clients. The variations at low client counts are quite small, and the patch isn't expected to do much at low concurrency levels, so that's probably just random variation. At scale factor 3000, the situation is more complicated. With only 16 bufmappinglocks, the patch gets its biggest win at 48 clients, and by 96 clients it's actually losing to unpatched master. But with 128 bufmappinglocks, it wins - often massively - on everything but the single-client test, which is a small loss, hopefully within experimental variation. Comments? Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote: 1. To see the effect of reduce-replacement-locking.patch, compare the first TPS number in each line to the third, or the second to the fourth. At scale factor 1000, the patch wins in all of the cases with 32 or more clients and exactly half of the cases with 1, 8, or 16 clients. The variations at low client counts are quite small, and the patch isn't expected to do much at low concurrency levels, so that's probably just random variation. At scale factor 3000, the situation is more complicated. With only 16 bufmappinglocks, the patch gets its biggest win at 48 clients, and by 96 clients it's actually losing to unpatched master. But with 128 bufmappinglocks, it wins - often massively - on everything but the single-client test, which is a small loss, hopefully within experimental variation. Comments? Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... Good question. My belief is that the number of buffer mapping locks required to avoid serious contention will be roughly proportional to the number of hardware threads. At the time the value 16 was chosen, there were probably not more than 8-core CPUs in common use; but now we've got a machine with 64 hardware threads and, what do you know but it wants 128 locks. I think the long-term solution here is that we need a lock-free hash table implementation for our buffer mapping tables, because I'm pretty sure that just cranking the number of locks up and up is going to start to have unpleasant side effects at some point. We may be able to buy a few more years by just cranking it up, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 09:51:17 -0400, Robert Haas wrote: On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas robertmh...@gmail.com wrote: The patch I attached the first time was just the last commit in the git repository where I wrote the patch, rather than the changes that I made on top of that commit. So, yes, the results from the previous message are with the patch attached to the follow-up. I just typed the wrong git command when attempting to extract that patch to attach it to the email. Here are some more results. TL;DR: The patch still looks good, but we should raise the number of buffer mapping partitions as well. So I'm inclined to (a) push reduce-replacement-locking.patch and then also (b) bump up the number of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS accordingly so that pg_buffercache doesn't get unhappy). I'm happy with that. I don't think it's likely that a moderate increase in the number of mapping lwlocks will be noticeably bad for any workload. One difference is that the total number of lwlock acquirations will be a bit higher because currently it's more likely for the old and new to fall into different partitions. But that's not really significant. The other difference is the number of cachelines touched. Currently, in concurrent workloads, there's already lots of L1 cache misses around the buffer mapping locks because they're exclusively owned by a different core/socket. So, to be effectively worse, the increase would need to lead to lower overall cache hit rates by them not being in *any* cache or displacing other content. That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote: On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote: 1. To see the effect of reduce-replacement-locking.patch, compare the first TPS number in each line to the third, or the second to the fourth. At scale factor 1000, the patch wins in all of the cases with 32 or more clients and exactly half of the cases with 1, 8, or 16 clients. The variations at low client counts are quite small, and the patch isn't expected to do much at low concurrency levels, so that's probably just random variation. At scale factor 3000, the situation is more complicated. With only 16 bufmappinglocks, the patch gets its biggest win at 48 clients, and by 96 clients it's actually losing to unpatched master. But with 128 bufmappinglocks, it wins - often massively - on everything but the single-client test, which is a small loss, hopefully within experimental variation. Comments? Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... Wrong way round. You need to prove the upside of increasing it further, not the contrary. The primary downside is cache hit ratio and displacing other cache entries... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com wrote: That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Amit has some results upthread showing 64 being good, but not as good as 128. I haven't verified that myself, but have no reason to doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 10:22:47 -0400, Robert Haas wrote: On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com wrote: That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Amit has some results upthread showing 64 being good, but not as good as 128. I haven't verified that myself, but have no reason to doubt it. How about you push the spinlock change and I crosscheck the partition number on a multi socket x86 machine? Seems worthwile to make sure that it doesn't cause problems on x86. I seriously doubt it'll, but ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 10:09:30 -0400, Robert Haas wrote: I think the long-term solution here is that we need a lock-free hash table implementation for our buffer mapping tables, because I'm pretty sure that just cranking the number of locks up and up is going to start to have unpleasant side effects at some point. We may be able to buy a few more years by just cranking it up, though. I think mid to long term we actually need something else than a hashtable. Capable of efficiently looking for the existance of 'neighboring' buffers so we can intelligently prefetch far enough that the read actually completes when we get there. Also I'm pretty sure that we'll need a way to efficiently remove all buffers for a relfilenode from shared buffers - linearly scanning for that isn't a good solution. So I think we need a different data structure. I've played a bit around with just replacing buf_table.c with a custom handrolled hashtable because I've seen more than one production workload where hash_search_with_hash_value() is both cpu and cache miss wise top#1 of profiles. With most calls coming from the buffer mapping and then from the lock manager. There's two reasons for that: a) dynahash just isn't very good and it does a lot of things that will never be necessary for these hashes. b) the key into the hash table is *far* too wide. A significant portion of the time is spent comparing buffer/lock tags. The aforementioned replacement hash table was a good bit faster for fully cached workloads - but at the time I wrote I could still make it crash in very high cache pressure workloads, so that should be taken with a fair bit of salt. I think we can comparatively easily get rid of the tablespace in buffer tags. Getting rid of the database already would be a fair bit harder. I haven't really managed to get an idea how to remove the fork number without making the catalog much more complicated. I don't think we can go too long without at least some of these steps :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote: Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... Wrong way round. You need to prove the upside of increasing it further, not the contrary. The primary downside is cache hit ratio and displacing other cache entries... I can't do that because I don't have the hardware. I wasn't suggesting to just set it but to measure the affects of setting it. But the benefits from going from 16 to 128 are pretty significant at least on this hardware; I'm curious how much further it can be pushed...what's wrong with trying it out? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-25 10:22:47 -0400, Robert Haas wrote: On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com wrote: That leads me to wonder: Have you measured different, lower, number of buffer mapping locks? 128 locks is, if we'd as we should align them properly, 8KB of memory. Common L1 cache sizes are around 32k... Amit has some results upthread showing 64 being good, but not as good as 128. I haven't verified that myself, but have no reason to doubt it. How about you push the spinlock change and I crosscheck the partition number on a multi socket x86 machine? Seems worthwile to make sure that it doesn't cause problems on x86. I seriously doubt it'll, but ... OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote: On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote: Why stop at 128 mapping locks? Theoretical downsides to having more mapping locks have been mentioned a few times but has this ever been measured? I'm starting to wonder if the # mapping locks should be dependent on some other value, perhaps the # of shared bufffers... Wrong way round. You need to prove the upside of increasing it further, not the contrary. The primary downside is cache hit ratio and displacing other cache entries... I can't do that because I don't have the hardware. One interesting part of this is making sure it doesn't regress older/smaller machines. So at least that side you could check... what's wrong with trying it out? If somebody is willing to do it: nothing. I'd just much rather do the, by now proven, simple change before starting with more complex solutions. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 19, 2014 at 7:21 AM, Amit Kapila amit.kapil...@gmail.com wrote: Specific numbers of both the configurations for which I have posted data in previous mail are as follows: Scale Factor - 800 Shared_Buffers - 12286MB (Total db size is 12288MB) Client and Thread Count = 64 buffers_touched_freelist - count of buffers that backends found touched after popping from freelist. buffers_backend_clocksweep - count of buffer allocations not satisfied from freelist buffers_alloc 1531023 buffers_backend_clocksweep 0 buffers_touched_freelist 0 I didn't believe these numbers, so I did some testing. I used the same configuration you mention here, scale factor = 800, shared_buffers = 12286 MB, and I also saw buffers_backend_clocksweep = 0. I didn't see buffers_touched_freelist showing up anywhere, so I don't know whether that would have been zero or not. Then I tried reducing the high watermark for the freelist from 2000 buffers to 25 buffers, and buffers_backend_clocksweep was *still* 0. At that point I started to smell a rat. It turns out that, with this test configuration, there's no buffer allocation going on at all. Everything fits in shared_buffers, or it did on my test. I had to reduce shared_buffers down to 10491800kB before I got any significant buffer eviction. At that level, a 100-buffer high watermark wasn't sufficient to prevent the freelist from occasionally going empty. A 2000-buffer high water mark was by and large sufficient, although I was able to see small numbers of buffers being allocated via clocksweep right at the very beginning of the test, I guess before the reclaimer really got cranking. So the watermarks seem to be broadly in the right ballpark, but I think the statistics reporting needs improving. We need an easy way to measure the amount of work that bgreclaimer is actually doing. I suggest we count these things: 1. The number of buffers the reclaimer has put back on the free list. 2. The number of times a backend has run the clocksweep. 3. The number of buffers past which the reclaimer has advanced the clock sweep (i.e. the number of buffers it had to examine in order to reclaim the number counted by #1). 4. The number of buffers past which a backend has advanced the clocksweep (i.e. the number of buffers it had to examine in order to allocate the number of buffers count by #3). 5. The number of buffers allocated from the freelist which the backend did not use because they'd been touched (what you're calling buffers_touched_freelist). It's hard to come up with good names for all of these things that are consistent with the somewhat wonky existing names. Here's an attempt: 1. bgreclaim_freelist 2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I think we want to make it more parallel with buffers_alloc, which is the number of buffers allocated, not buffers_backend, the number of buffers *written* by a backend) 3. clocksweep_bgreclaim 4. clocksweep_backend 5. freelist_touched -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Scaling shared buffer eviction
Hi, On 2014-09-23 10:31:24 -0400, Robert Haas wrote: I suggest we count these things: 1. The number of buffers the reclaimer has put back on the free list. 2. The number of times a backend has run the clocksweep. 3. The number of buffers past which the reclaimer has advanced the clock sweep (i.e. the number of buffers it had to examine in order to reclaim the number counted by #1). 4. The number of buffers past which a backend has advanced the clocksweep (i.e. the number of buffers it had to examine in order to allocate the number of buffers count by #3). 5. The number of buffers allocated from the freelist which the backend did not use because they'd been touched (what you're calling buffers_touched_freelist). Sounds good. It's hard to come up with good names for all of these things that are consistent with the somewhat wonky existing names. Here's an attempt: 1. bgreclaim_freelist bgreclaim_alloc_clocksweep? 2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I think we want to make it more parallel with buffers_alloc, which is the number of buffers allocated, not buffers_backend, the number of buffers *written* by a backend) 3. clocksweep_bgreclaim 4. clocksweep_backend I think bgreclaim/backend should always be either be a prefix or a postfix. But not one in some variables and some in another. 5. freelist_touched I wonder if we shouldn't move all this to a new view, instead of stuffing it somewhere where it really doesn't belong. pg_stat_buffers or something like it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com wrote: [ review ] Oh, by the way, I noticed that this patch breaks pg_buffercache. If we're going to have 128 lock partitions, we need to bump MAX_SIMUL_LWLOCKS. But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. I think it's pretty clear that #1 and #2 are a good idea. #3 is a mixed bag, and it might account for the regressions you saw on some test cases. Increasing the number of buffer mapping locks means that those locks take up more cache lines, which could slow things down in cases where there's no reduction in contention. It also means that the chances of an allocated buffer ending up in the same buffer mapping lock partition are 1/128 instead of 1/16, which means about 5.4 additional lwlock acquire/release cycles per 100 allocations. That's not a ton, but it's not free either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote: But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. I did some more experimentation on this. Attached is a patch that JUST does #1, and, as previously suggested, it uses a single spinlock instead of using two of them that are probably in the same cacheline. Without this patch, on a 32-client, read-only pgbench at scale factor 1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about LWLock-inspired preemption: - LWLockAcquire + 68.41% ReadBuffer_common + 31.59% StrategyGetBuffer With the patch, unsurprisingly, StrategyGetBuffer disappears and the only lwlocks that are causing context-switches are the individual buffer locks. But are we suffering spinlock contention instead as a result? Yes, but not much. s_lock is at 0.41% in the corresponding profile, and only 6.83% of those calls are from the patched StrategyGetBuffer. In a similar profile of master, s_lock is at 0.31%. So there's a bit of additional s_lock contention, but I think it's considerably less than the contention over the lwlock it's replacing, because the spinlock is only held for the minimal amount of time needed, whereas the lwlock could be held across taking and releasing many individual buffer locks. TPS results are a little higher with the patch - these are alternating 5 minute runs: master tps = 176010.647944 (including connections establishing) master tps = 176615.291149 (including connections establishing) master tps = 175648.370487 (including connections establishing) reduce-replacement-locking tps = 177888.734320 (including connections establishing) reduce-replacement-locking tps = 177797.842410 (including connections establishing) reduce-replacement-locking tps = 177894.822656 (including connections establishing) The picture is similar at 64 clients, but the benefit is a little more: master tps = 179037.231597 (including connections establishing) master tps = 180500.937068 (including connections establishing) master tps = 181565.706514 (including connections establishing) reduce-replacement-locking tps = 185741.503425 (including connections establishing) reduce-replacement-locking tps = 188598.728062 (including connections establishing) reduce-replacement-locking tps = 187340.977277 (including connections establishing) What's interesting is that I can't see in the perf output any real sign that the buffer mapping locks are slowing things down, but they clearly are, because when I take this patch and also boost NUM_BUFFER_PARTITIONS to 128, the performance goes up: reduce-replacement-locking-128 tps = 251001.812843 (including connections establishing) reduce-replacement-locking-128 tps = 247368.925927 (including connections establishing) reduce-replacement-locking-128 tps = 250775.304177 (including connections establishing) The performance also goes up if I do that on master, but the effect is substantially less: master-128 tps = 219301.492902 (including connections establishing) master-128 tps = 219786.249076 (including connections establishing) master-128 tps = 219821.220271 (including connections establishing) I think this shows pretty clearly that, even without the bgreclaimer, there's a lot of merit in getting rid of BufFreelistLock and using a spinlock held for the absolutely minimal number of instructions instead. There's already some benefit without doing anything about the buffer mapping locks, and we'll get a lot more benefit once that issue is addressed. I think we need to do some serious study to see whether bgreclaimer is even necessary, because it looks like this change alone, which is much simpler, takes a huge bite out of the scalability problem. I welcome further testing and comments, but my current inclination is to go ahead and push the attached patch. To my knowledge, nobody has at any point objected to this aspect of what's being proposed, and it looks safe to me and seems to be a clear win. We can then deal with the other issues on their own merits. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company commit df4077cda2eae3eb4a5cf387da0c1e7616e73204 Author: Robert Haas rh...@postgresql.org Date: Mon Sep 22 16:42:14 2014 -0400 Remove volatile qualifiers from lwlock.c. Now that spinlocks (hopefully!) act as compiler barriers, as of commit 0709b7ee72e4bc71ad07b7120acd117265ab51d0, this should be safe. This serves as a demonstration of the new coding style, and may be optimized better on some machines as well. diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 7c96da5..66fb2e4 100644 --- a/src/backend/storage/lmgr/lwlock.c +++
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas robertmh...@gmail.com wrote: I did some more experimentation on this. Attached is a patch that JUST does #1, and, ... ...and that was the wrong patch. Thanks to Heikki for point that out. Second try. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README index 1fd38d0..a4ebbcc 100644 --- a/src/backend/storage/buffer/README +++ b/src/backend/storage/buffer/README @@ -125,14 +125,12 @@ bits of the tag's hash value. The rules stated above apply to each partition independently. If it is necessary to lock more than one partition at a time, they must be locked in partition-number order to avoid risk of deadlock. -* A separate system-wide LWLock, the BufFreelistLock, provides mutual +* A separate system-wide spinlock, buffer_strategy_lock, provides mutual exclusion for operations that access the buffer free list or select -buffers for replacement. This is always taken in exclusive mode since -there are no read-only operations on those data structures. The buffer -management policy is designed so that BufFreelistLock need not be taken -except in paths that will require I/O, and thus will be slow anyway. -(Details appear below.) It is never necessary to hold the BufMappingLock -and the BufFreelistLock at the same time. +buffers for replacement. A spinlock is used here rather than a lightweight +lock for efficiency; no other locks of any sort should be acquired while +buffer_strategy_lock is held. This is essential to allow buffer replacement +to happen in multiple backends with reasonable concurrency. * Each buffer header contains a spinlock that must be taken when examining or changing fields of that buffer header. This allows operations such as @@ -165,7 +163,7 @@ consider their pages unlikely to be needed soon; however, the current algorithm never does that. The list is singly-linked using fields in the buffer headers; we maintain head and tail pointers in global variables. (Note: although the list links are in the buffer headers, they are -considered to be protected by the BufFreelistLock, not the buffer-header +considered to be protected by the buffer_strategy_lock, not the buffer-header spinlocks.) To choose a victim buffer to recycle when there are no free buffers available, we use a simple clock-sweep algorithm, which avoids the need to take system-wide locks during common operations. It works like @@ -178,25 +176,26 @@ buffer reference count, so it's nearly free.) The clock hand is a buffer index, nextVictimBuffer, that moves circularly through all the available buffers. nextVictimBuffer is protected by the -BufFreelistLock. +buffer_strategy_lock. The algorithm for a process that needs to obtain a victim buffer is: -1. Obtain BufFreelistLock. +1. Obtain buffer_strategy_lock. -2. If buffer free list is nonempty, remove its head buffer. If the buffer -is pinned or has a nonzero usage count, it cannot be used; ignore it and -return to the start of step 2. Otherwise, pin the buffer, release -BufFreelistLock, and return the buffer. +2. If buffer free list is nonempty, remove its head buffer. Release +buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count, +it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer, +and return it. -3. Otherwise, select the buffer pointed to by nextVictimBuffer, and -circularly advance nextVictimBuffer for next time. +3. Otherwise, the buffer free list is empty. Select the buffer pointed to by +nextVictimBuffer, and circularly advance nextVictimBuffer for next time. +Release buffer_strategy_lock. 4. If the selected buffer is pinned or has a nonzero usage count, it cannot -be used. Decrement its usage count (if nonzero) and return to step 3 to -examine the next buffer. +be used. Decrement its usage count (if nonzero), reacquire +buffer_strategy_lock, and return to step 3 to examine the next buffer. -5. Pin the selected buffer, release BufFreelistLock, and return the buffer. +5. Pin the selected buffer, and return. (Note that if the selected buffer is dirty, we will have to write it out before we can recycle it; if someone else pins the buffer meanwhile we will @@ -259,7 +258,7 @@ dirty and not pinned nor marked with a positive usage count. It pins, writes, and releases any such buffer. If we can assume that reading nextVictimBuffer is an atomic action, then -the writer doesn't even need to take the BufFreelistLock in order to look +the writer doesn't even need to take buffer_strategy_lock in order to look for buffers to write; it needs only to spinlock each buffer header for long enough to check the dirtybit. Even without that assumption, the writer only needs to take the lock long enough to read the variable value, not diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index
Re: [HACKERS] Scaling shared buffer eviction
Robert Haas robertmh...@gmail.com writes: On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas robertmh...@gmail.com wrote: I did some more experimentation on this. Attached is a patch that JUST does #1, and, ... ...and that was the wrong patch. Thanks to Heikki for point that out. Second try. But the results you gave in the previous message were correctly attributed? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas robertmh...@gmail.com wrote: I did some more experimentation on this. Attached is a patch that JUST does #1, and, ... ...and that was the wrong patch. Thanks to Heikki for point that out. Second try. But the results you gave in the previous message were correctly attributed? The patch I attached the first time was just the last commit in the git repository where I wrote the patch, rather than the changes that I made on top of that commit. So, yes, the results from the previous message are with the patch attached to the follow-up. I just typed the wrong git command when attempting to extract that patch to attach it to the email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 9/23/14, 10:31 AM, Robert Haas wrote: I suggest we count these things: 1. The number of buffers the reclaimer has put back on the free list. 2. The number of times a backend has run the clocksweep. 3. The number of buffers past which the reclaimer has advanced the clock sweep (i.e. the number of buffers it had to examine in order to reclaim the number counted by #1). 4. The number of buffers past which a backend has advanced the clocksweep (i.e. the number of buffers it had to examine in order to allocate the number of buffers count by #3). 5. The number of buffers allocated from the freelist which the backend did not use because they'd been touched (what you're calling buffers_touched_freelist). All sound reasonable. To avoid wasting time here, I think it's only worth doing all of these as DEBUG level messages for now. Then only go through the overhead of exposing the ones that actually seem relevant. That's what I did for the 8.3 work, and most of that data at this level was barely relevant to anyone but me then or since. We don't want the system views to include so much irrelevant trivia that finding the important parts becomes overwhelming. I'd like to see that level of instrumentation--just the debug level messages--used to quantify the benchmarks that people are running already, to prove they are testing what they think they are. That would have caught the test error you already stumbled on for example. Simple paranoia says there may be more issues like that hidden in here somewhere, and this set you've identified should find any more of them around. If all that matches up so the numbers for the new counters seem sane, I think that's enough to commit the first basic improvement here. Then make a second pass over exposing the useful internals that let everyone quantify either the improvement or things that may need to be tunable. -- Greg Smith greg.sm...@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-23 16:29:16 -0400, Robert Haas wrote: On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote: But this gets at another point: the way we're benchmarking this right now, we're really conflating the effects of three different things: 1. Changing the locking regimen around the freelist and clocksweep. 2. Adding a bgreclaimer process. 3. Raising the number of buffer locking partitions. I did some more experimentation on this. Attached is a patch that JUST does #1, and, as previously suggested, it uses a single spinlock instead of using two of them that are probably in the same cacheline. Without this patch, on a 32-client, read-only pgbench at scale factor 1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about LWLock-inspired preemption: - LWLockAcquire + 68.41% ReadBuffer_common + 31.59% StrategyGetBuffer With the patch, unsurprisingly, StrategyGetBuffer disappears and the only lwlocks that are causing context-switches are the individual buffer locks. But are we suffering spinlock contention instead as a result? Yes, but not much. s_lock is at 0.41% in the corresponding profile, and only 6.83% of those calls are from the patched StrategyGetBuffer. In a similar profile of master, s_lock is at 0.31%. So there's a bit of additional s_lock contention, but I think it's considerably less than the contention over the lwlock it's replacing, because the spinlock is only held for the minimal amount of time needed, whereas the lwlock could be held across taking and releasing many individual buffer locks. Am I understanding you correctly that you also measured context switches for spinlocks? If so, I don't think that's a valid comparison. LWLocks explicitly yield the CPU as soon as there's any contention while spinlocks will, well, spin. Sure they also go to sleep, but it'll take longer. It's also worthwile to remember in such comparisons that lots of the cost of spinlocks will be in the calling routines, not s_lock() - the first TAS() is inlined into it. And that's the one that'll incur cache misses and such... Note that I'm explicitly *not* doubting the use of a spinlock itself. Given the short acquiration times and the exclusive use of exlusive acquiration a spinlock makes more sense. The lwlock's spinlock alone will have about as much contention. I think it might be possible to construct some cases where the spinlock performs worse than the lwlock. But I think those will be clearly in the minority. And at least some of those will be fixed by bgwriter. As an example consider a single backend COPY ... FROM of large files with a relatively large s_b. That's a seriously bad workload for the current victim buffer selection because frequently most of the needs to be searched for a buffer with usagecount 0. And this patch will double the amount of atomic ops during that. Let me try to quantify that. What's interesting is that I can't see in the perf output any real sign that the buffer mapping locks are slowing things down, but they clearly are, because when I take this patch and also boost NUM_BUFFER_PARTITIONS to 128, the performance goes up: What did events did you profile? I welcome further testing and comments, but my current inclination is to go ahead and push the attached patch. To my knowledge, nobody has at any point objected to this aspect of what's being proposed, and it looks safe to me and seems to be a clear win. We can then deal with the other issues on their own merits. I've took a look at this, and all the stuff I saw that I disliked were there before this patch. So +1 for going ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 6:02 PM, Gregory Smith gregsmithpg...@gmail.com wrote: On 9/23/14, 10:31 AM, Robert Haas wrote: I suggest we count these things: 1. The number of buffers the reclaimer has put back on the free list. 2. The number of times a backend has run the clocksweep. 3. The number of buffers past which the reclaimer has advanced the clock sweep (i.e. the number of buffers it had to examine in order to reclaim the number counted by #1). 4. The number of buffers past which a backend has advanced the clocksweep (i.e. the number of buffers it had to examine in order to allocate the number of buffers count by #3). 5. The number of buffers allocated from the freelist which the backend did not use because they'd been touched (what you're calling buffers_touched_freelist). All sound reasonable. To avoid wasting time here, I think it's only worth doing all of these as DEBUG level messages for now. Then only go through the overhead of exposing the ones that actually seem relevant. That's what I did for the 8.3 work, and most of that data at this level was barely relevant to anyone but me then or since. We don't want the system views to include so much irrelevant trivia that finding the important parts becomes overwhelming. I think we expose far too little information in our system views. Just to take one example, we expose no useful information about lwlock acquire or release, but a lot of real-world performance problems are caused by lwlock contention. There are of course difficulties in exposing huge numbers of counters, but we're not talking about many here, so I'd lean toward exposing them in the final patch if they seem at all useful. I'd like to see that level of instrumentation--just the debug level messages--used to quantify the benchmarks that people are running already, to prove they are testing what they think they are. That would have caught the test error you already stumbled on for example. Simple paranoia says there may be more issues like that hidden in here somewhere, and this set you've identified should find any more of them around. Right. If all that matches up so the numbers for the new counters seem sane, I think that's enough to commit the first basic improvement here. Then make a second pass over exposing the useful internals that let everyone quantify either the improvement or things that may need to be tunable. Well, I posted a patch a bit ago that I think is the first basic improvement - and none of these counters are relevant to that. It doesn't add a new background process or anything; it just does pretty much the same thing we do now with less-painful locking. There are no water marks to worry about, or tunable thresholds, or anything; and because it's so much simpler, it's far easier to reason about than the full patch, which is why I feel quite confident pressing on toward a commit. Once that is in, I think we should revisit the idea of a bgreclaimer process, and see how much more that improves things - if at all - on top of what that basic patch already does. For that we'll need these counters, and maybe others. But let's make that phase two. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund and...@2ndquadrant.com wrote: Am I understanding you correctly that you also measured context switches for spinlocks? If so, I don't think that's a valid comparison. LWLocks explicitly yield the CPU as soon as there's any contention while spinlocks will, well, spin. Sure they also go to sleep, but it'll take longer. No, I measured CPU consumption attributable to s_lock. (I checked context-switches, too.) It's also worthwile to remember in such comparisons that lots of the cost of spinlocks will be in the calling routines, not s_lock() - the first TAS() is inlined into it. And that's the one that'll incur cache misses and such... True. I can check that - I did not. Note that I'm explicitly *not* doubting the use of a spinlock itself. Given the short acquiration times and the exclusive use of exlusive acquiration a spinlock makes more sense. The lwlock's spinlock alone will have about as much contention. Right. I think it might be possible to construct some cases where the spinlock performs worse than the lwlock. But I think those will be clearly in the minority. And at least some of those will be fixed by bgwriter. As an example consider a single backend COPY ... FROM of large files with a relatively large s_b. That's a seriously bad workload for the current victim buffer selection because frequently most of the needs to be searched for a buffer with usagecount 0. And this patch will double the amount of atomic ops during that. It will actually be far worse than that, because we'll acquire and release the spinlock for every buffer over which we advance the clock sweep, instead of just once for the whole thing. That's reason to hope that a smart bgreclaimer process may be a good improvement on top of this. It can do things like advance the clock sweep hand 16 buffers at a time and then sweep them all after-the-fact, reducing contention on the mutex by an order-of-magnitude, if that turns out to be an important consideration. But I think it's right to view that as something we need to test vs. the baseline established by this patch. What's clear today is that workloads which stress buffer-eviction fall to pieces, because the entire buffer-eviction process is essentially single-threaded. One process can't begin evicting a buffer until another has finished doing so. This lets multiple backends do that at the same time. We may find cases where that leads to an unpleasant amount of contention, but since we have several ideas for how to mitigate that as needs be, I think it's OK to go ahead. The testing we're doing on the combined patch is conflating the effects the new locking regimen with however the bgreclaimer affects things, and it's very clear to me now that we need to make sure those are clearly separate. Let me try to quantify that. Please do. What's interesting is that I can't see in the perf output any real sign that the buffer mapping locks are slowing things down, but they clearly are, because when I take this patch and also boost NUM_BUFFER_PARTITIONS to 128, the performance goes up: What did events did you profile? cs. I welcome further testing and comments, but my current inclination is to go ahead and push the attached patch. To my knowledge, nobody has at any point objected to this aspect of what's being proposed, and it looks safe to me and seems to be a clear win. We can then deal with the other issues on their own merits. I've took a look at this, and all the stuff I saw that I disliked were there before this patch. So +1 for going ahead. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-23 19:21:10 -0400, Robert Haas wrote: On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund and...@2ndquadrant.com wrote: I think it might be possible to construct some cases where the spinlock performs worse than the lwlock. But I think those will be clearly in the minority. And at least some of those will be fixed by bgwriter. Err, this should read bgreclaimer, not bgwriter. As an example consider a single backend COPY ... FROM of large files with a relatively large s_b. That's a seriously bad workload for the current victim buffer selection because frequently most of the needs to be searched for a buffer with usagecount 0. And this patch will double the amount of atomic ops during that. It will actually be far worse than that, because we'll acquire and release the spinlock for every buffer over which we advance the clock sweep, instead of just once for the whole thing. I said double, because we already acquire the buffer header's spinlock every tick. That's reason to hope that a smart bgreclaimer process may be a good improvement on top of this. Right. That's what I was trying to say with bgreclaimer above. But I think it's right to view that as something we need to test vs. the baseline established by this patch. Agreed. I think the possible downsides are at the very fringe of already bad cases. That's why I agreed that you should go ahead. Let me try to quantify that. Please do. I've managed to find a ~1.5% performance regression. But the setup was plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into a fillfactor 10 table with the column set to PLAIN storage. With the resulting table size chosen so it's considerably bigger than s_b, but smaller than the dirty writeback limit of the kernel. That's perfectly reasonable. I can think of a couple other cases, but they're all similarly absurd. What's interesting is that I can't see in the perf output any real sign that the buffer mapping locks are slowing things down, but they clearly are, because when I take this patch and also boost NUM_BUFFER_PARTITIONS to 128, the performance goes up: What did events did you profile? cs. Ah. My guess is that most of the time will probably actually be spent in the lwlock's spinlock, not the the lwlock putting itself to sleep. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 23, 2014 at 7:42 PM, Andres Freund and...@2ndquadrant.com wrote: It will actually be far worse than that, because we'll acquire and release the spinlock for every buffer over which we advance the clock sweep, instead of just once for the whole thing. I said double, because we already acquire the buffer header's spinlock every tick. Oh, good point. Let me try to quantify that. Please do. I've managed to find a ~1.5% performance regression. But the setup was plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into a fillfactor 10 table with the column set to PLAIN storage. With the resulting table size chosen so it's considerably bigger than s_b, but smaller than the dirty writeback limit of the kernel. That's perfectly reasonable. I can think of a couple other cases, but they're all similarly absurd. Well, it's not insane to worry about such things, but if you can only manage 1.5% on such an extreme case, I'm encouraged. This is killing us on OLTP workloads, and fixing that is a lot more important than a couple percent on an extreme case. Ah. My guess is that most of the time will probably actually be spent in the lwlock's spinlock, not the the lwlock putting itself to sleep. Ah, OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 9/23/14, 7:13 PM, Robert Haas wrote: I think we expose far too little information in our system views. Just to take one example, we expose no useful information about lwlock acquire or release, but a lot of real-world performance problems are caused by lwlock contention. I sent over a proposal for what I was calling Performance Events about a year ago. The idea was to provide a place to save data about lock contention, weird checkpoint sync events, that sort of thing. Replacing log parsing to get at log_lock_waits data was my top priority. Once that's there, lwlocks was an obvious next target. Presumably we just needed collection to be low enough overhead, and then we can go down to whatever shorter locks we want; lower the overhead, faster the event we can measure. Sometimes the database will never be able to instrument some of its fastest events without blowing away the event itself. We'll still have perf / dtrace / systemtap / etc. for those jobs. But those are not the problems of the average Postgres DBA's typical day. The data people need to solve this sort of thing in production can't always show up in counters. You'll get evidence the problem is there, but you need more details to actually find the culprit. Some info about the type of lock, tables and processes involved, maybe the query that's running, that sort of thing. You can kind of half-ass the job if you make per-tables counter for everything, but we really need more, both to serve our users and to compare well against what other databases provide for tools. That's why I was trying to get the infrastructure to capture all that lock detail, without going through the existing logging system first. Actually building Performance Events fell apart on the storage side: figuring out where to put it all without waiting for a log file to hit disk. I wanted in-memory storage so clients don't wait for anything, then a potentially lossy persistence writer. I thought I could get away with a fixed size buffer like pg_stat_statements uses. That was optimistic. Trying to do better got me lost in memory management land without making much progress. I think the work you've now done on dynamic shared memory gives the right shape of infrastructure that I could pull this off now. I even have funding to work on it again, and it's actually the #2 thing I'd like to take on as I get energy for new feature development. (#1 is the simple but time consuming job of adding block write counters, the lack of which which is just killing me on some fast growing installs) I have a lot of unread messages on this list to sort through right now. I know I saw someone try to revive the idea of saving new sorts of performance log data again recently; can't seem to find it again right now. That didn't seem like it went any farther than thinking about the specifications though. The last time I jumped right over that and hit a wall with this one hard part of the implementation instead, low overhead memory management for saving everything. -- Greg Smith greg.sm...@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Mon, Sep 22, 2014 at 10:43 AM, Gregory Smith gregsmithpg...@gmail.com wrote: On 9/16/14, 8:18 AM, Amit Kapila wrote: I think the main reason for slight difference is that when the size of shared buffers is almost same as data size, the number of buffers it needs from clock sweep are very less, as an example in first case (when size of shared buffers is 12286MB), it actually needs at most 256 additional buffers (2MB) via clock sweep, where as bgreclaimer will put 2000 (high water mark) additional buffers (0.5% of shared buffers is greater than 2000 ) in free list, so bgreclaimer does some extra work when it is not required This is exactly what I was warning about, as the sort of lesson learned from the last round of such tuning. There are going to be spots where trying to tune the code to be aggressive on the hard cases will work great. But you need to make that dynamic to some degree, such that the code doesn't waste a lot of time sweeping buffers when the demand for them is actually weak. That will make all sorts of cases that look like this slower. To verify whether above can lead to any kind of regression, I have checked the cases (workload is 0.05 or 0.1 percent larger than shared buffers) where we need few extra buffers and bgreclaimer might put some additional buffers and it turns out that in those cases also, there is a win especially at high concurrency and results of the same are posted upthread ( http://www.postgresql.org/message-id/CAA4eK1LFGcvzMdcD5NZx7B2gCbP1G7vWK7w32EZk=voolud...@mail.gmail.com). We should be able to tell these apart if there's enough instrumentation and solid logic inside of the program itself though. The 8.3 era BGW coped with a lot of these issues using a particular style of moving average with fast reaction time, plus instrumenting the buffer allocation rate as accurately as it could. So before getting into high/low water note questions, are you comfortable that there's a clear, accurate number that measures the activity level that's important here? Very Good Question. This was exactly the thing which was missing in my initial versions (about 2 years back when I tried to solve this problem) but based on Robert's and Andres's feedback I realized that we need an accurate number to measure the activity level (in this case it is consumption of buffers from freelist), so I have introduced the logic to calculate the same (it is stored in new variable numFreeListBuffers in BufferStrategyControl structure). And have you considered ways it might be averaging over time or have a history that's analyzed? The current logic of bgreclaimer is such that even if it does some extra activity (extra is very much controlled) in one cycle, it will not start another cycle unless backends consume all the buffers that were made available by bgreclaimer in one cycle. I think the algorithm designed for bgreclaimer automatically averages out based on activity. Do you see any cases where it will not do so? The exact fast approach / slow decay weighted moving average approach of the 8.3 BGW, the thing that tried to smooth the erratic data set possible here, was a pretty critical part of getting itself auto-tuning to workload size. It ended up being much more important than the work of setting the arbitrary watermark levels. Agreed, but the logic with which bgwriter works is pretty different and thats why it needs different kind of logic to handle auto-tuning. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 9/16/14, 8:18 AM, Amit Kapila wrote: I think the main reason for slight difference is that when the size of shared buffers is almost same as data size, the number of buffers it needs from clock sweep are very less, as an example in first case (when size of shared buffers is 12286MB), it actually needs at most 256 additional buffers (2MB) via clock sweep, where as bgreclaimer will put 2000 (high water mark) additional buffers (0.5% of shared buffers is greater than 2000 ) in free list, so bgreclaimer does some extra work when it is not required This is exactly what I was warning about, as the sort of lesson learned from the last round of such tuning. There are going to be spots where trying to tune the code to be aggressive on the hard cases will work great. But you need to make that dynamic to some degree, such that the code doesn't waste a lot of time sweeping buffers when the demand for them is actually weak. That will make all sorts of cases that look like this slower. We should be able to tell these apart if there's enough instrumentation and solid logic inside of the program itself though. The 8.3 era BGW coped with a lot of these issues using a particular style of moving average with fast reaction time, plus instrumenting the buffer allocation rate as accurately as it could. So before getting into high/low water note questions, are you comfortable that there's a clear, accurate number that measures the activity level that's important here? And have you considered ways it might be averaging over time or have a history that's analyzed? The exact fast approach / slow decay weighted moving average approach of the 8.3 BGW, the thing that tried to smooth the erratic data set possible here, was a pretty critical part of getting itself auto-tuning to workload size. It ended up being much more important than the work of setting the arbitrary watermark levels. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 16, 2014 at 10:21 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila amit.kapil...@gmail.com wrote: In most cases performance with patch is slightly less as compare to HEAD and the difference is generally less than 1% and in a case or 2 close to 2%. I think the main reason for slight difference is that when the size of shared buffers is almost same as data size, the number of buffers it needs from clock sweep are very less, as an example in first case (when size of shared buffers is 12286MB), it actually needs at most 256 additional buffers (2MB) via clock sweep, where as bgreclaimer will put 2000 (high water mark) additional buffers (0.5% of shared buffers is greater than 2000 ) in free list, so bgreclaimer does some extra work when it is not required and it also leads to condition you mentioned down (freelist will contain buffers that have already been touched since we added them). Now for case 2 (12166MB), we need buffers more than 2000 additional buffers, but not too many, so it can also have similar effect. So there are two suboptimal things that can happen and they pull in opposite directions. I think you should instrument the server how often each is happening. #1 is that we can pop a buffer from the freelist and find that it's been touched. That means we wasted the effort of putting it on the freelist in the first place. #2 is that we can want to pop a buffer from the freelist and find it empty and thus be forced to run the clock sweep ourselves. If we're having problem #1, we could improve things by reducing the water marks. If we're having problem #2, we could improve things by increasing the water marks. If we're having both problems, then I dunno. But let's get some numbers on the frequency of these specific things, rather than just overall tps numbers. Specific numbers of both the configurations for which I have posted data in previous mail are as follows: Scale Factor - 800 Shared_Buffers - 12286MB (Total db size is 12288MB) Client and Thread Count = 64 buffers_touched_freelist - count of buffers that backends found touched after popping from freelist. buffers_backend_clocksweep - count of buffer allocations not satisfied from freelist buffers_alloc 1531023 buffers_backend_clocksweep 0 buffers_touched_freelist 0 Scale Factor - 800 Shared_Buffers - 12166MB (Total db size is 12288MB) Client and Thread Count = 64 buffers_alloc 1531010 buffers_backend_clocksweep 0 buffers_touched_freelist 0 In both the above cases, I have taken data multiple times to ensure correctness. From the above data, it is evident that in both the above configurations all the requests are satisfied from the initial freelist. Basically the amount of shared buffers configured (12286MB = 1572608 buffers and 12166MB = 1557248 buffers) are sufficient to contain all the work load for pgbench run. So now the question is why we are seeing small variation (1%) in data in case all the data fits in shared buffers and the reason could be that we have added few extra instructions (due to increase in StrategyControl structure size, additional function call, one or two new assignments) in the Buffer Allocation path (the extra instructions will also be only till all the data pages gets associated with buffers, after that the control won't even reach StrategyGetBuffer()) or it may be due to variation across different runs with different binaries. I have went ahead to take the data in cases shared buffers are tiny bit (0.1% and .05%) less than workload (based on buffer allocations done in above cases). Performance Data --- Scale Factor - 800 Shared_Buffers - 11950MB Client_Count/Patch_Ver 8 16 32 64 128 HEAD 68424 132540 195496 279511 283280 sbe_v9 68565 132709 194631 284351 289333 Scale Factor - 800 Shared_Buffers - 11955MB Client_Count/Patch_Ver 8 16 32 64 128 HEAD 68331 127752 196385 274387 281753 sbe_v9 68922 131314 194452 284292 287221 The above data indicates that performance is better with patch in almost all cases and especially at high concurrency (64 and 128 client count). The overall conclusion is that with patch a. when the data can fit in RAM and not completely in shared buffers, the performance/scalability is quite good even if shared buffers are just tiny bit less that all the data. b. when shared buffers are sufficient to contain all the data, then there is a slight difference (1%) in performance. d. Lets not do anything as if user does such a configuration, he should be educated to configure shared buffers in a better way and or the performance hit doesn't seem to be justified to do any further work. At least worth entertaining. Based on further analysis, I think this is the way to go. Attached find the patch for new stat (buffers_touched_freelist) just in case you want to run the patch with it and detailed (individual run) performance data. With Regards,
Re: [HACKERS] Scaling shared buffer eviction
On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 12, 2014 at 11:55 AM, Amit Chapel amit.kapil...@gmail.com wrote: On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: I will post the data with the latest patch separately (where I will focus on new cases discussed between Robert and Andres). Performance Data with latest version of patch. All the data shown below is a median of 3 runs, for each individual run data, refer attached document (perf_read_scalability_data_v9.ods) Performance Data for Read-only test - Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min shared_buffers=8GB scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins All the data is in tps and taken using pgbench read-only load Client_Count/Patch_Ver 8 16 32 64 128 HEAD 58614 107370 140717 104357 65010 sbe_v9 62943 119064 172246 220174 220904 Observations - 1. It scales well as with previous versions of patch, but it seems the performance is slightly better in few cases, may be because I have removed a statement (if check) or 2 in bgreclaimer (those were done under spinlock) or it could be just run-to-run difference. (1) A read-only pgbench workload that is just a tiny bit larger than shared_buffers, say size of shared_buffers plus 0.01%. Such workloads tend to stress buffer eviction heavily. When the data is just tiny bit larger than shared buffers, actually there is no problem in scalability even in HEAD, because I think most of the requests will be satisfied from existing buffer pool. I have taken data for some of the loads where database size is bit larger than shared buffers and it is as follows: Scale Factor - 800 Shared_Buffers - 12286MB (Total db size is 12288MB) Client_Count/Patch_Ver 1 8 16 32 64 128 HEAD 8406 68712 13 198481 290340 289828 sbe_v9 8504 68546 131926 195789 289959 289021 Scale Factor - 800 Shared_Buffers - 12166MB (Total db size is 12288MB) Client_Count/Patch_Ver 1 8 16 32 64 128 HEAD 8428 68609 128092 196596 292066 293812 sbe_v9 8386 68546 126926 197126 289959 287621 Observations - In most cases performance with patch is slightly less as compare to HEAD and the difference is generally less than 1% and in a case or 2 close to 2%. I think the main reason for slight difference is that when the size of shared buffers is almost same as data size, the number of buffers it needs from clock sweep are very less, as an example in first case (when size of shared buffers is 12286MB), it actually needs at most 256 additional buffers (2MB) via clock sweep, where as bgreclaimer will put 2000 (high water mark) additional buffers (0.5% of shared buffers is greater than 2000 ) in free list, so bgreclaimer does some extra work when it is not required and it also leads to condition you mentioned down (freelist will contain buffers that have already been touched since we added them). Now for case 2 (12166MB), we need buffers more than 2000 additional buffers, but not too many, so it can also have similar effect. I think we have below options related to this observation a. Some further tuning in bgreclaimer, so that instead of putting the buffers up to high water mark in freelist, it puts just 1/4th or 1/2 of high water mark and then check if the free list still contains lesser than equal to low water mark, if yes it continues and if not then it can wait (or may be some other way). b. Instead of waking bgreclaimer when the number of buffers fall below low water mark, wake when the number of times backends does clock sweep crosses certain threshold c. Give low and high water mark as config knobs, so that in some rare cases users can use them to do tuning. d. Lets not do anything as if user does such a configuration, he should be educated to configure shared buffers in a better way and or the performance hit doesn't seem to be justified to do any further work. Now if we do either of 'a' or 'b', then I think there is a chance that the gain might not be same for cases where users can easily get benefit from this patch and there is a chance that it degrades the performance in some other case. (2) A workload that maximizes the rate of concurrent buffer eviction relative to other tasks. Read-only pgbench is not bad for this, but maybe somebody's got a better idea. I think the first test of pgbench (scale_factor-3000;shared_buffers-8GB) addresses this case. As I sort of mentioned in what I was writing for the bufmgr README, there are, more or less, three ways this can fall down, at least that I can see: (1) if the high water mark is too high, then we'll start finding buffers in the freelist
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila amit.kapil...@gmail.com wrote: In most cases performance with patch is slightly less as compare to HEAD and the difference is generally less than 1% and in a case or 2 close to 2%. I think the main reason for slight difference is that when the size of shared buffers is almost same as data size, the number of buffers it needs from clock sweep are very less, as an example in first case (when size of shared buffers is 12286MB), it actually needs at most 256 additional buffers (2MB) via clock sweep, where as bgreclaimer will put 2000 (high water mark) additional buffers (0.5% of shared buffers is greater than 2000 ) in free list, so bgreclaimer does some extra work when it is not required and it also leads to condition you mentioned down (freelist will contain buffers that have already been touched since we added them). Now for case 2 (12166MB), we need buffers more than 2000 additional buffers, but not too many, so it can also have similar effect. So there are two suboptimal things that can happen and they pull in opposite directions. I think you should instrument the server how often each is happening. #1 is that we can pop a buffer from the freelist and find that it's been touched. That means we wasted the effort of putting it on the freelist in the first place. #2 is that we can want to pop a buffer from the freelist and find it empty and thus be forced to run the clock sweep ourselves. If we're having problem #1, we could improve things by reducing the water marks. If we're having problem #2, we could improve things by increasing the water marks. If we're having both problems, then I dunno. But let's get some numbers on the frequency of these specific things, rather than just overall tps numbers. I think we have below options related to this observation a. Some further tuning in bgreclaimer, so that instead of putting the buffers up to high water mark in freelist, it puts just 1/4th or 1/2 of high water mark and then check if the free list still contains lesser than equal to low water mark, if yes it continues and if not then it can wait (or may be some other way). That sounds suspiciously like just reducing the high water mark. b. Instead of waking bgreclaimer when the number of buffers fall below low water mark, wake when the number of times backends does clock sweep crosses certain threshold That doesn't sound helpful. c. Give low and high water mark as config knobs, so that in some rare cases users can use them to do tuning. Yuck. d. Lets not do anything as if user does such a configuration, he should be educated to configure shared buffers in a better way and or the performance hit doesn't seem to be justified to do any further work. At least worth entertaining. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: +++ b/src/backend/postmaster/bgreclaimer.c A fair number of comments in that file refer to bgwriter... Will fix. Fixed. @@ -0,0 +1,302 @@ +/*- + * + * bgreclaimer.c + * + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It + * attempts to keep regular backends from having to run clock sweep (which + * they would only do when they don't find a usable shared buffer from + * freelist to read in another page). That's not really accurate. Freelist pages are often also needed to write new pages, without reading anything in. Agreed, but the same is used in bgwriter file as well; so if we change here, we might want to change bgwriter file header as well. I have just changed bgreclaimer for this comment, if the same is required for bgwriter, I can create a separate patch as that change is not related to this patch, so I thought it is better to keep it separate. I'd phrase it as which they only need to do if they don't find a victim buffer from the freelist victim buffer sounds more like a buffer which it will get from clock sweep, how about next candidate (same is used in function header of StrategyGetBuffer()). Fixed. In the best scenario all requests + * for shared buffers will be fulfilled from freelist as the background + * reclaimer process always tries to maintain buffers on freelist. However, + * regular backends are still empowered to run clock sweep to find a usable + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. empowered sounds strange to me. 'still can run the clock sweep'? No harm in changing like what you are suggesting, however the same is used in file header of bgwriter.c as well, so I think lets keep this usage as it is because there is no correctness issue here. Not changed anything for this comment. +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. I have kept based on bgwriter, so not sure if it's good to change. However I we want consistent in naming, I would like to keep something like: ReclaimShutdownHandler ReclaimQuickDieHandler .. .. Changed function names to make them consistent. + /* + * If an exception is encountered, processing resumes here. + * + * See notes in postgres.c about the design of this coding. + */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + { + /* Since not using PG_TRY, must reset error stack by hand */ + error_context_stack = NULL; .. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). I will add LWLockReleaseAll() in exception handling as discussed elsewhere in thread. Done. + + /* Now we can allow interrupts again */ + RESUME_INTERRUPTS(); Other processes sleep for a second here, I think that's a good idea. E.g. that bit: Agreed, will make change as per suggestion. Done. + /* + * Loop forever + */ + for (;;) + { + int rc; + + + /* + * Backend will signal bgreclaimer when the number of buffers in + * freelist falls below than low water mark of freelist. + */ + rc = WaitLatch(MyProc-procLatch, +WL_LATCH_SET | WL_POSTMASTER_DEATH, +-1); That's probably not going to work well directly after a (re)start of bgreclaim (depending on how you handle the water mark, I'll see in a bit). Could you please be more specific here? I wasn't sure if any change is required here, so kept the code as it is. +Background Reclaimer's Processing +- .. +Two water mark indicators are used to maintain sufficient number of buffers +on freelist. Low water mark indicator is used by backends to wake bgreclaimer +when the number of buffers in freelist falls below it. High water mark +indicator is used by bgreclaimer to move buffers to freelist. For me the description of the high water as stated here doesn't seem to explain anything. This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith gregsmithpg...@gmail.com wrote: This looks like it's squashed one of the very fundamental buffer scaling issues though; well done Amit. Thanks. I'll go back to my notes and try to recreate the pathological cases that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync spreading effort I did; get those running again and see how they do on this new approach. I have a decent sized 24 core server that should be good enough for this job. I'll see what I can do. It will be really helpful if you can try out those cases. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 14/09/14 19:00, Amit Kapila wrote: On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith gregsmithpg...@gmail.com mailto:gregsmithpg...@gmail.com wrote: This looks like it's squashed one of the very fundamental buffer scaling issues though; well done Amit. Thanks. I'll go back to my notes and try to recreate the pathological cases that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync spreading effort I did; get those running again and see how they do on this new approach. I have a decent sized 24 core server that should be good enough for this job. I'll see what I can do. It will be really helpful if you can try out those cases. And if you want 'em run on the 60 core beast, just let me know the details and I'll do some runs for you. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: +++ b/src/backend/postmaster/bgreclaimer.c A fair number of comments in that file refer to bgwriter... Will fix. @@ -0,0 +1,302 @@ +/*- + * + * bgreclaimer.c + * + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It + * attempts to keep regular backends from having to run clock sweep (which + * they would only do when they don't find a usable shared buffer from + * freelist to read in another page). That's not really accurate. Freelist pages are often also needed to write new pages, without reading anything in. Agreed, but the same is used in bgwriter file as well; so if we change here, we might want to change bgwriter file header as well. I'd phrase it as which they only need to do if they don't find a victim buffer from the freelist victim buffer sounds more like a buffer which it will get from clock sweep, how about next candidate (same is used in function header of StrategyGetBuffer()). In the best scenario all requests + * for shared buffers will be fulfilled from freelist as the background + * reclaimer process always tries to maintain buffers on freelist. However, + * regular backends are still empowered to run clock sweep to find a usable + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. empowered sounds strange to me. 'still can run the clock sweep'? No harm in changing like what you are suggesting, however the same is used in file header of bgwriter.c as well, so I think lets keep this usage as it is because there is no correctness issue here. + * The bgwriter is started by the postmaster as soon as the startup subprocess + * finishes, or as soon as recovery begins if we are doing archive recovery. Why only archive recovery? I guess (only read this far...) it's not just during InArchiveRecoveryb recovery but also StandbyMode? It will be for both. But I don't see why we shouldn't use it during normal crash recovery. That's also often painfully slow and the reclaimer could help? Less, but still. Yes, it might improve a bit, however the main benefit with this patch is under heavy load which means that it mainly addresses contention and in case of crash recovery there will not be any contention because there will be no backend processes. Also I think to enable bgreclaimer during crash recovery, we need to have one new signal which is not a problem, but it will make more sense to enable it later based on benefit if it is there. +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. I have kept based on bgwriter, so not sure if it's good to change. However I we want consistent in naming, I would like to keep something like: ReclaimShutdownHandler ReclaimQuickDieHandler .. .. + /* + * If an exception is encountered, processing resumes here. + * + * See notes in postgres.c about the design of this coding. + */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + { + /* Since not using PG_TRY, must reset error stack by hand */ + error_context_stack = NULL; .. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). I will add LWLockReleaseAll() in exception handling as discussed elsewhere in thread. + + /* Now we can allow interrupts again */ + RESUME_INTERRUPTS(); Other processes sleep for a second here, I think that's a good idea. E.g. that bit: Agreed, will make change as per suggestion. + /* + * Loop forever + */ + for (;;) + { + int rc; + + + /* + * Backend will signal bgreclaimer when the number of buffers in + * freelist falls below than low water mark of freelist. + */ + rc = WaitLatch(MyProc-procLatch, +WL_LATCH_SET | WL_POSTMASTER_DEATH, +-1); That's probably not going to work well directly after a (re)start of bgreclaim (depending on how you handle the water mark, I'll see in a bit). Could you please be more specific here? +Background Reclaimer's Processing +- .. +Two water mark indicators are used to maintain sufficient number of buffers +on freelist. Low water mark indicator is used by backends to wake bgreclaimer +when the number of buffers in freelist falls below it. High water mark +indicator is used by bgreclaimer to move
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote: Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... I don't think it will help much. A pinned buffer is pretty likely to be in modified state in the cache of the cpu of the pinning backend. Even taking a look at the refcount will trigger a writeback and demotion to shared state. When time comes to unpin the buffer the cacheline must again be promoted to exclusive state introducing coherency traffic. Not locking the buffer only saves transfering the cacheline back to the pinning backend, not a huge amount of savings. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-12 12:38:48 +0300, Ants Aasma wrote: On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote: Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... I don't think it will help much. A pinned buffer is pretty likely to be in modified state in the cache of the cpu of the pinning backend. Right. Unless you're on a MOESI platforms. I'd really like to know why that's not more widely used. Even taking a look at the refcount will trigger a writeback and demotion to shared state. When time comes to unpin the buffer the cacheline must again be promoted to exclusive state introducing coherency traffic. Not locking the buffer only saves transfering the cacheline back to the pinning backend, not a huge amount of savings. Yes. But: In many, if not most, cases the cacheline will be read a couple times before modifying it via the spinlock. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 9/11/14, 7:01 AM, Andres Freund wrote: I'm not convinced that these changes can be made without also changing the bgwriter logic. Have you measured whether there are differences in how effective the bgwriter is? Not that it's very effective right now :) The current background writer tuning went out of its way to do nothing when it wasn't clear there was something that always worked. What happened with all of the really clever schemes was that they worked on some workloads, and just trashed others. Most of the gain from the 8.3 rewrite came from looking at well theorized ideas for how to handle things like pre-emptive LRU scanning for writes, and just throwing them out altogether in favor of ignoring the problem. The magic numbers left in or added to the code were tuned to do very little work, intentionally. If anything, since then the pressure to do nothing has gone up in the default install, because now people are very concerned about extra wakeups using power. To find bad cases before, I was running about 30 different test combinations by the end, Heikki was running another set in the EDB lab, I believe there was a lab at NTT running their own set too. What went in was the combination that didn't break any of them badly--not the one that performed best on the good workloads. This looks like it's squashed one of the very fundamental buffer scaling issues though; well done Amit. What I'd like to see is preserving the heart of that while touching as little as possible. When in doubt, do nothing; let the backends suck it up and do the work themselves. I had to take a health break from community development for a while, and I'm hoping to jump back into review again for the rest of the current development cycle. I'll go back to my notes and try to recreate the pathological cases that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync spreading effort I did; get those running again and see how they do on this new approach. I have a decent sized 24 core server that should be good enough for this job. I'll see what I can do. -- Greg Smith greg.sm...@crunchydatasolutions.com Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Hi, On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c new file mode 100644 index 000..3df2337 --- /dev/null +++ b/src/backend/postmaster/bgreclaimer.c A fair number of comments in that file refer to bgwriter... @@ -0,0 +1,302 @@ +/*- + * + * bgreclaimer.c + * + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It + * attempts to keep regular backends from having to run clock sweep (which + * they would only do when they don't find a usable shared buffer from + * freelist to read in another page). That's not really accurate. Freelist pages are often also needed to write new pages, without reading anything in. I'd phrase it as which they only need to do if they don't find a victim buffer from the freelist In the best scenario all requests + * for shared buffers will be fulfilled from freelist as the background + * reclaimer process always tries to maintain buffers on freelist. However, + * regular backends are still empowered to run clock sweep to find a usable + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. empowered sounds strange to me. 'still can run the clock sweep'? + * The bgwriter is started by the postmaster as soon as the startup subprocess + * finishes, or as soon as recovery begins if we are doing archive recovery. Why only archive recovery? I guess (only read this far...) it's not just during InArchiveRecoveryb recovery but also StandbyMode? But I don't see why we shouldn't use it during normal crash recovery. That's also often painfully slow and the reclaimer could help? Less, but still. +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. + /* + * If an exception is encountered, processing resumes here. + * + * See notes in postgres.c about the design of this coding. + */ + if (sigsetjmp(local_sigjmp_buf, 1) != 0) + { + /* Since not using PG_TRY, must reset error stack by hand */ + error_context_stack = NULL; + + /* Prevent interrupts while cleaning up */ + HOLD_INTERRUPTS(); + + /* Report the error to the server log */ + EmitErrorReport(); + + /* + * These operations are really just a minimal subset of + * AbortTransaction(). We don't have very many resources to worry + * about in bgreclaim, but we do have buffers and file descriptors. + */ + UnlockBuffers(); + AtEOXact_Buffers(false); + AtEOXact_Files(); No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). + /* + * Now return to normal top-level context and clear ErrorContext for + * next time. + */ + MemoryContextSwitchTo(bgreclaim_context); + FlushErrorState(); + + /* Flush any leaked data in the top-level context */ + MemoryContextResetAndDeleteChildren(bgreclaim_context); + + /* Now we can allow interrupts again */ + RESUME_INTERRUPTS(); Other processes sleep for a second here, I think that's a good idea. E.g. that bit: /* * Sleep at least 1 second after any error. A write error is likely * to be repeated, and we don't want to be filling the error logs as * fast as we can. */ pg_usleep(100L); + /* + * Loop forever + */ + for (;;) + { + int rc; + + + /* + * Backend will signal bgreclaimer when the number of buffers in + * freelist falls below than low water mark of freelist. + */ + rc = WaitLatch(MyProc-procLatch, +WL_LATCH_SET | WL_POSTMASTER_DEATH, +-1); That's probably not going to work well directly after a (re)start of bgreclaim (depending on how you handle the water mark, I'll see in a bit). Maybe it should rather be ResetLatch(); BgMoveBuffersToFreelist(); pgstat_send_bgwriter(); rc = WaitLatch() if (rc WL_POSTMASTER_DEATH) exit(1) +Background Reclaimer's Processing +- + +The background reclaimer is designed to move buffers to freelist that are +likely to be recycled soon, thereby offloading the need to perform +clock sweep work from active
Re: [HACKERS] Scaling shared buffer eviction
Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote: +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. It's exactly the same as what bgwriter.c does. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. +Two water mark indicators are used to maintain sufficient number of buffers +on freelist. Low water mark indicator is used by backends to wake bgreclaimer +when the number of buffers in freelist falls below it. High water mark +indicator is used by bgreclaimer to move buffers to freelist. For me the description of the high water as stated here doesn't seem to explain anything. Yeah, let me try to revise and expand on that a bit: Background Reclaimer's Processing - The background reclaimer runs the clock sweep to identify buffers that are good candidates for eviction and puts them on the freelist. This makes buffer allocation much faster, since removing a buffer from the head of a linked list is much cheaper than linearly scanning the whole buffer pool until a promising candidate is found. It's possible that a buffer we add to the freelist may be accessed or even pinned before it's evicted; if that happens, the backend that would have evicted it will simply disregard it and take the next buffer instead (or run the clock sweep itself, if necessary). However, to make sure that doesn't happen too often, we need to keep the freelist as short as possible, so that there won't be many other buffer accesses between when the time a buffer is added to the freelist and the time when it's actually evicted. We use two water marks to control the activity of the bgreclaimer process. Each time bgreclaimer is awoken, it will move buffers to the freelist until the length of the free list reaches the high water mark. It will then sleep. When the number of buffers on the freelist reaches the low water mark, backends attempting to allocate new buffers will set the bgreclaimer's latch, waking it up again. While it's important for the high water mark to be small (for the reasons described above), we also need to ensure adequate separation between the low and high water marks, so that the bgreclaimer isn't constantly being awoken to find just a handful of additional candidate buffers, and we need to ensure that the low watermark is adequate to keep the freelist from becoming completely empty before bgreclaimer has time to wake up and beginning filling it again. This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter write out? The bgwriter is cleaning ahead of the strategy point, and the bgreclaimer is advancing the strategy point. I think we should consider having the bgreclaimer wake the bgwriter if it comes across a dirty buffer, because while the bgwriter only estimates the rate of buffer allocation, bgreclaimer *knows* the rate of allocation, because its own activity is tied to the allocation rate. I think there's the potential for this kind of thing to make the background writer significantly more effective than it is today, but I'm heavily in favor of leaving it for a separate patch. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. * Are we sure that the freelist_lck spinlock won't cause pain? Right now there will possibly be dozens of processes busily spinning on it... I think it's a acceptable risk, but we should think about it. As you and I have talked about before, we could reduce contention here by partitioning the freelist, or by using a CAS loop to pop items off of it. But I am not convinced either
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 6:32 PM, Robert Haas robertmh...@gmail.com wrote: Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote: +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. It's exactly the same as what bgwriter.c does. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. Code related to bgreclaimer logic itself doesn't take any LWLock, do you suspect the same might be required due to some Signal/Interrupt handling? From myside, I have thought about what to keep for error cleanup based on the working of bgreclaimer. However there is a chance that I have missed something. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-11 09:02:34 -0400, Robert Haas wrote: Thanks for reviewing, Andres. On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote: +static void bgreclaim_quickdie(SIGNAL_ARGS); +static void BgreclaimSigHupHandler(SIGNAL_ARGS); +static void ReqShutdownHandler(SIGNAL_ARGS); +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); This looks inconsistent. It's exactly the same as what bgwriter.c does. So what? There's no code in common, so I see no reason to have one signal handler using underscores and the next one camelcase names. No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a good idea, regardless of it possibly being true today (which I'm not sure about yet). We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. Agreed. There really should be three variants: * full abort including support for transactions * full abort without transactions being used (most background processes) * abort without shared memory interactions But for this patch, I think we should aim to mimic the existing style, as ugly as it is. Agreed. Background Reclaimer's Processing - The background reclaimer runs the clock sweep to identify buffers that are good candidates for eviction and puts them on the freelist. This makes buffer allocation much faster, since removing a buffer from the head of a linked list is much cheaper than linearly scanning the whole buffer pool until a promising candidate is found. It's possible that a buffer we add to the freelist may be accessed or even pinned before it's evicted; if that happens, the backend that would have evicted it will simply disregard it and take the next buffer instead (or run the clock sweep itself, if necessary). However, to make sure that doesn't happen too often, we need to keep the freelist as short as possible, so that there won't be many other buffer accesses between when the time a buffer is added to the freelist and the time when it's actually evicted. We use two water marks to control the activity of the bgreclaimer process. Each time bgreclaimer is awoken, it will move buffers to the freelist until the length of the free list reaches the high water mark. It will then sleep. I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... This section should have a description of how the reclaimer interacts with the bgwriter logic. Do we put dirty buffers on the freelist that are then cleaned by the bgwriter? Which buffers does the bgwriter write out? The bgwriter is cleaning ahead of the strategy point, and the bgreclaimer is advancing the strategy point. That sentence, in some form, should be in the above paragraph. I think we should consider having the bgreclaimer wake the bgwriter if it comes across a dirty buffer, because while the bgwriter only estimates the rate of buffer allocation, bgreclaimer *knows* the rate of allocation, because its own activity is tied to the allocation rate. I think there's the potential for this kind of thing to make the background writer significantly more effective than it is today, but I'm heavily in favor of leaving it for a separate patch. Yes, doing that sounds like a good plan. I'm happy with that being done in a separate patch. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... * Are we sure that the freelist_lck spinlock won't cause pain? Right now there will possibly be dozens of processes busily spinning on it... I think it's a acceptable risk, but we should think about it. As you and I have talked
Re: [HACKERS] Scaling shared buffer eviction
We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. Code related to bgreclaimer logic itself doesn't take any LWLock, do you suspect the same might be required due to some Signal/Interrupt handling? I suspect it might creep in at some point at some unrelated place. Which will only ever break in production scenarios. Say, a lwlock in in config file processing. I seem to recall somebody seing a version of a patching adding a lwlock there... :). Or a logging hook. Or ... The savings from not doing LWLockReleaseAll() are nonexistant, so ... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 6:59 PM, Andres Freund and...@2ndquadrant.com wrote: We really need a more centralized way to handle error cleanup in auxiliary processes. The current state of affairs is really pretty helter-skelter. But for this patch, I think we should aim to mimic the existing style, as ugly as it is. I'm not sure whether Amit's got the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum, is probably a good idea. Code related to bgreclaimer logic itself doesn't take any LWLock, do you suspect the same might be required due to some Signal/Interrupt handling? I suspect it might creep in at some point at some unrelated place. Which will only ever break in production scenarios. Say, a lwlock in in config file processing. Yeah, I suspected the same and checked that path, but couldn't find but may be in some path it is there as the code has many flows. I seem to recall somebody seing a version of a patching adding a lwlock there... :). Or a logging hook. Or ... The savings from not doing LWLockReleaseAll() are nonexistant, so ... Okay, I shall add it in next version of patch and mention in comments the reasons quoted by you. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote: It's exactly the same as what bgwriter.c does. So what? There's no code in common, so I see no reason to have one signal handler using underscores and the next one camelcase names. /me shrugs. It's not always possible to have things be consistent with each other within a file and also with what gets done in other files. I'm not sure we should fault patch authors for choosing a different one than we would have chosen. FWIW, I probably would have done it the way Amit did it. I don't actually care, though. I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. Me, too, but I'm *even more* skeptical about making things complicated on the pure theory that a simple solution can't be correct. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. When, and if, we develop a theory about specifically how it falls short then, sure, let's adjust it. But I think it would be a serious error to try to engineer a perfect algorithm here based on the amount of testing that we can reasonably do pre-commit. We have no chance of getting that right, and I'd rather have an algorithm that is simple and imperfect than an algorithm that is complex and still imperfect. No matter what, it's better than what we have now. Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... It's not unreasonable, but this is a good place to apply Knuth's first law of optimization. There's no proof we need to do this, so let's not until there is. I also think it would be good to get some statistics on how often regular backends are running the clocksweep vs. how often bgreclaimer is satisfying their needs. I think that's necessary. The patch added buf_backend_clocksweep. Maybe we just also need buf_backend_from_freelist? That's just (or should be just) buf_alloc - buf_backend_clocksweep. I think buf_backend_clocksweep should really be called buf_alloc_clocksweep, and should be added (in all relevant places) right next to buf_alloc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 2014-09-11 09:48:10 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote: I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end up being continously busy without being visible. I wonder if we don't want to increase the high watermark when tmp_recent_backend_clocksweep 0? That doesn't really work unless there's some countervailing force to eventually reduce it again; otherwise, it'd just converge to infinity. And it doesn't really seem necessary at the moment. Right, it obviously needs to go both ways. I'm a bit sceptic about untunable, fixed, numbers proving to be accurate for widely varied workloads. Me, too, but I'm *even more* skeptical about making things complicated on the pure theory that a simple solution can't be correct. Fair enough. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. But, to be honest, the testing so far was pretty narrow in the kind of workloads that were run if I crossread things accurately. Don't get me wrong, I'm *really* happy about having this patch, that just doesn't mean every detail is right ;) Hm. Perhaps we should do a bufHdr-refcount != zero check without locking here? The atomic op will transfer the cacheline exclusively to the reclaimer's CPU. Even though it very shortly afterwards will be touched afterwards by the pinning backend. Meh. I'm not in favor of adding more funny games with locking unless we can prove they're necessary for performance. Well, this in theory increases the number of processes touching buffer headers regularly. Currently, if you have one read IO intensive backend, there's pretty much only process touching the cachelines. This will make it two. I don't think it's unreasonable to try to reduce the cacheline pingpong caused by that... It's not unreasonable, but this is a good place to apply Knuth's first law of optimization. There's no proof we need to do this, so let's not until there is. That's true for new (pieces of) software; less so, when working with a installed base that you might regress... But whatever. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-09-11 09:48:10 -0400, Robert Haas wrote: On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote: I wonder if we should recheck the number of freelist items before sleeping. As the latch currently is reset before sleeping (IIRC) we might miss being woken up soon. It very well might be that bgreclaim needs to run for more than one cycle in a row to keep up... The outer loop in BgMoveBuffersToFreelist() was added to address precisely this point, which I raised in a previous review. Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end up being continously busy without being visible. Good idea. I'm not blind to the possibility that the current logic is inadequate, but testing proves that it works well enough to produce a massive performance boost over where we are now. But, to be honest, the testing so far was pretty narrow in the kind of workloads that were run if I crossread things accurately. Don't get me wrong, I'm *really* happy about having this patch, that just doesn't mean every detail is right ;) Oh, sure. Totally agreed. And, to the extent that we're improving things based on actual testing, I'm A-OK with that. I just don't want to start speculating, or we'll never get this thing off the ground. Some possibly-interesting test cases would be: (1) A read-only pgbench workload that is just a tiny bit larger than shared_buffers, say size of shared_buffers plus 0.01%. Such workloads tend to stress buffer eviction heavily. (2) A workload that maximizes the rate of concurrent buffer eviction relative to other tasks. Read-only pgbench is not bad for this, but maybe somebody's got a better idea. As I sort of mentioned in what I was writing for the bufmgr README, there are, more or less, three ways this can fall down, at least that I can see: (1) if the high water mark is too high, then we'll start finding buffers in the freelist that have already been touched since we added them: (2) if the low water mark is too low, the freelist will run dry; and (3) if the low and high water marks are too close together, the bgreclaimer will be constantly getting woken up and going to sleep again. I can't personally think of a workload that will enable us to get a better handle on those cases than high-concurrency pgbench, but you're known to be ingenious at coming up with destruction workloads, so if you have an idea, by all means fire away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. I took another read through this patch. Here are some further review comments: 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have both num_to_free and tmp_num_to_free. num_to_free is used to accumulate total number of buffers that are freed in one cycle of BgMoveBuffersToFreelist() which is reported for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary number which is a count of number of buffers to be freed in one sub-cycle (inner while loop) I'd get rid of tmp_num_to_free and move the declaration of num_to_free inside the outer loop. I'd also move the definitions of tmp_next_to_clean, tmp_recent_alloc, tmp_recent_backend_clocksweep into the innermost scope in which they are used. okay, I have moved the tmp_* variables in innermost scope. 2. Also in that function, I think the innermost bit of logic could be rewritten more compactly, and in such a way as to make it clearer for what set of instructions the buffer header will be locked. LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if (bufHdr-usage_count 0) bufHdr-usage_count--; else add_to_freelist = true; } UnlockBufHdr(bufHdr); if (add_to_freelist StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--; Changed as per suggestion. 3. This comment is now obsolete: + /* +* If bgwriterLatch is set, we need to waken the bgwriter, but we should +* not do so while holding freelist_lck; so set it after releasing the +* freelist_lck. This is annoyingly tedious, but it happens at most once +* per bgwriter cycle, so the performance hit is minimal. +*/ + We're not actually holding any lock in need of releasing at that point in the code, so this can be shortened to If bgwriterLatch is set, we need to waken the bgwriter. Changed as per suggestion. * Ideally numFreeListBuffers should get called under freelist spinlock, That doesn't make any sense. numFreeListBuffers is a variable, so you can't call it. The value should be *read* under the spinlock, but it is. I think this whole comment can be deleted and replaced with If the number of free buffers has fallen below the low water mark, awaken the bgreclaimer to repopulate it. Changed as per suggestion. 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return the same victim buffer that's being handed out - at almost the same time - to a backend running the clock sweep; if it does, they'll fight over the buffer. Two, the *end out parameter actually returns a count, not an endpoint. I think we should have BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the top of the inner loop rather than the bottom, and change StrategySyncStartAndEnd() so that it knows nothing about victimbuf_lck. Let's also change StrategyGetBuffer() to call StrategySyncNextVictimBuffer() so that the logic is centralized in one place, and rename StrategySyncStartAndEnd() to something that better matches its revised purpose. Changed as per suggestion. I have also updated StrategySyncNextVictimBuffer() such that it increments completePasses on completion of cycle as I think it is appropriate to update it, even when clock sweep is done by bgreclaimer. Maybe StrategyGetReclaimInfo(). I have changed it to StrategyGetFreelistAccessInfo() as it seems most other functions in freelist.c uses the names to sound something related to buffers. 5. Have you tested that this new bgwriter statistic is actually working? Because it looks to me like BgMoveBuffersToFreelist is changing BgWriterStats but never calling pgstat_send_bgwriter(), which I'm thinking will mean the counters accumulate forever inside the reclaimer but never get forwarded to the stats collector. pgstat_send_bgwriter() is called in bgreclaimer loop (caller of BgMoveBuffersToFreelist, this is similar to how bgwriter does). I have done few tests with it before sending the previous patch. 6. StrategyInitialize() uses #defines for HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000) for clamping. Let's have constants for all of those (and omit mention of the specific values in the comments). Changed as per suggestion. 7. The line you've added to the definition of view pg_stat_bgwriter doesn't seem to be indented the same way as all the others. Tab vs. space problem? Fixed. Performance Data:
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 05/09/14 23:50, Amit Kapila wrote: On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation, but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case. Thanks for doing the test. I think if possible you can take the performance data with higher scale factor (4000) as it seems your m/c has 1TB of RAM. You might want to use latest patch I have posted today. Here's some fairly typical data from read-write and read-only runs at scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not seeing much variation between repeated read-write runs with the same config (which is nice - sleep 30 and explicit checkpoint call between each one seem to help there). Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be better at higher client counts than beta1 was... In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case. Yeah, that's what I was expecting, the benefit of this patch will be more at higher client count when there is large data and all the data can fit in RAM . Many thanks for doing the performance test for patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 10/09/14 18:54, Amit Kapila wrote: On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case. Yeah, that's what I was expecting, the benefit of this patch will be more at higher client count when there is large data and all the data can fit in RAM . Many thanks for doing the performance test for patch. No worries, this is looking like a patch we're going to apply to 9.4 to make the 60 core beast scale a bit better, so thanks very much for your work in this area. If you would like more tests run at higher scales let me know (we have two of these machines at pre-production state currently so I can run benchmarks as reqd)! Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 9, 2014 at 3:11 AM, Thom Brown t...@linux.com wrote: On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. Here's a set of test results against this patch: Many thanks for taking the performance data. This data clearly shows that there is a performance improvement at even lower configuration, however the real benefit of the patch can be seen with higher core m/c and with larger RAM (can contain all the data). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On 05/09/14 23:50, Amit Kapila wrote: On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: On 04/09/14 14:42, Amit Kapila wrote: On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz wrote: Hi Amit, Results look pretty good. Does it help in the read-write case too? Last time I ran the tpc-b test of pgbench (results of which are posted earlier in this thread), there doesn't seem to be any major gain for that, however for cases where read is predominant, you might see better gains. I am again planing to take that data in next few days. FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation, but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case. Thanks for doing the test. I think if possible you can take the performance data with higher scale factor (4000) as it seems your m/c has 1TB of RAM. You might want to use latest patch I have posted today. Here's some fairly typical data from read-write and read-only runs at scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not seeing much variation between repeated read-write runs with the same config (which is nice - sleep 30 and explicit checkpoint call between each one seem to help there). Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be better at higher client counts than beta1 was... In terms of the effect of the patch - looks pretty similar to the scale 2000 results for read-write, but read-only is a different and more interesting story - unpatched 9.4 is noticeably impacted in the higher client counts, whereas the patched version scales as well (or even better perhaps) than in the scale 2000 case. read write (600s) Clients | tps| tps (unpatched) -++ 6 | 9395 | 9334 12 | 16605 | 16525 24 | 24634 | 24910 48 | 32170 | 31275 96 | 35675 | 36533 192 | 35579 | 31137 384 | 30528 | 28308 read only (300s) Clients | tps| tps (unpatched) -++ 6 | 35743 | 35362 12 | 111019 | 106579 24 | 199746 | 160305 48 | 327026 | 198407 96 | 379184 | 171863 192 | 356623 | 152224 384 | 340878 | 128308 regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Tue, Sep 9, 2014 at 3:46 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote: One regression failed on linux due to spacing issue which is fixed in attached patch. I just read the latest patch by curiosity, wouldn't it make more sense to split the patch into two different patches for clarity: one for the reclaimer worker centralized around BgMoveBuffersToFreelist and a second for the pg_stat portion? Those seem two different features. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com wrote: Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717 104357 65010 Patch 60092 113564 165014 213848 216065 This data is median of 3 runs, detailed report is attached with mail. I have not repeated the test for all configurations, as there is no major change in design/algorithm which can effect performance. Mark has already taken tpc-b data which ensures that there is no problem with it, however I will also take it once with latest version. Well, these numbers are pretty much amazing. Question: It seems there's obviously quite a bit of contention left; do you think there's still a significant amount of time in the clock sweep, or is the primary bottleneck the buffer mapping locks? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. I took another read through this patch. Here are some further review comments: 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have both num_to_free and tmp_num_to_free. I'd get rid of tmp_num_to_free and move the declaration of num_to_free inside the outer loop. I'd also move the definitions of tmp_next_to_clean, tmp_recent_alloc, tmp_recent_backend_clocksweep into the innermost scope in which they are used. 2. Also in that function, I think the innermost bit of logic could be rewritten more compactly, and in such a way as to make it clearer for what set of instructions the buffer header will be locked. LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if (bufHdr-usage_count 0) bufHdr-usage_count--; else add_to_freelist = true; } UnlockBufHdr(bufHdr); if (add_to_freelist StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--; 3. This comment is now obsolete: + /* +* If bgwriterLatch is set, we need to waken the bgwriter, but we should +* not do so while holding freelist_lck; so set it after releasing the +* freelist_lck. This is annoyingly tedious, but it happens at most once +* per bgwriter cycle, so the performance hit is minimal. +*/ + We're not actually holding any lock in need of releasing at that point in the code, so this can be shortened to If bgwriterLatch is set, we need to waken the bgwriter. * Ideally numFreeListBuffers should get called under freelist spinlock, That doesn't make any sense. numFreeListBuffers is a variable, so you can't call it. The value should be *read* under the spinlock, but it is. I think this whole comment can be deleted and replaced with If the number of free buffers has fallen below the low water mark, awaken the bgreclaimer to repopulate it. 4. StrategySyncStartAndEnd() is kind of a mess. One, it can return the same victim buffer that's being handed out - at almost the same time - to a backend running the clock sweep; if it does, they'll fight over the buffer. Two, the *end out parameter actually returns a count, not an endpoint. I think we should have BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the top of the inner loop rather than the bottom, and change StrategySyncStartAndEnd() so that it knows nothing about victimbuf_lck. Let's also change StrategyGetBuffer() to call StrategySyncNextVictimBuffer() so that the logic is centralized in one place, and rename StrategySyncStartAndEnd() to something that better matches its revised purpose. Maybe StrategyGetReclaimInfo(). 5. Have you tested that this new bgwriter statistic is actually working? Because it looks to me like BgMoveBuffersToFreelist is changing BgWriterStats but never calling pgstat_send_bgwriter(), which I'm thinking will mean the counters accumulate forever inside the reclaimer but never get forwarded to the stats collector. 6. StrategyInitialize() uses #defines for HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000) for clamping. Let's have constants for all of those (and omit mention of the specific values in the comments). 7. The line you've added to the definition of view pg_stat_bgwriter doesn't seem to be indented the same way as all the others. Tab vs. space problem? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote: Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. One regression failed on linux due to spacing issue which is fixed in attached patch. Here's a set of test results against this patch: 8-core AMD FX-9590, 32GB RAM Config: checkpoint_segments = 256 checkpoint_timeout = 15min shared_buffers = 8GB pgbench scale factor = 3000 run duration time = 5min HEAD Client Count/No. Of Runs (tps) 8 16 32 64 128 Run-1 31568 41890 48638 49266 41845 Run-2 31613 41879 48597 49332 41736 Run-3 31674 41987 48647 49320 41745 Patch=scalable_buffer_eviction_v7.patch Client Count/No. Of Runs (tps) 8 16 32 64 128 Run-1 31880 42943 49359 49901 43779 Run-2 32150 43078 48934 49828 43769 Run-3 32323 42894 49481 49600 43529 -- Thom
Re: [HACKERS] Scaling shared buffer eviction
On Mon, Sep 8, 2014 at 10:12 PM, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com wrote: Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717 104357 65010 Patch 60092 113564 165014 213848 216065 This data is median of 3 runs, detailed report is attached with mail. I have not repeated the test for all configurations, as there is no major change in design/algorithm which can effect performance. Mark has already taken tpc-b data which ensures that there is no problem with it, however I will also take it once with latest version. Well, these numbers are pretty much amazing. Question: It seems there's obviously quite a bit of contention left; do you think there's still a significant amount of time in the clock sweep, or is the primary bottleneck the buffer mapping locks? I think contention around clock sweep is very minimal and for buffer mapping locks it has reduced significantly (you can refer upthread some of LWLOCK stat data, I have posted after this patch), but might be we can get more out of it by improving hash table concurrency. However apart from both of the above, the next thing I have seen in profiles was palloc (at least that is what I remember when I had done some profiling few months back during the development of this patch). It seems to me at that time a totally different optimisation, so I left it for another patch. Another point is that the m/c on which I am doing performance test has 16 cores (64 hardware threads), so not sure how much scaling we can expect. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have updated the patch to address the feedback. Main changes are: 1. For populating freelist, have a separate process (bgreclaimer) instead of doing it by bgwriter. 2. Autotune the low and high threshold values for buffers in freelist. I have used the formula as suggested by you upthread. 3. Cleanup of locking regimen as discussed upthread (completely eliminated BufFreelist Lock). 4. Improved comments and general code cleanup. Overall this looks quite promising to me. I had thought to call the new process just bgreclaim rather than bgreclaimer, but perhaps your name is better after all. At least, it matches what we do elsewhere. But I don't care for the use Bgreclaimer; let's do BgReclaimer if we really need mixed-case, or else bgreclaimer. Changed it to bgreclaimer. This is unclear: +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that +is needed to perform clock sweep which is a longer operation, however now we +are using two spinklocks freelist_lck and victimbuf_lck to perform operations +on freelist and run clock sweep respectively. I would drop the discussion of what was done before and say something like this: The data structures relating to buffer eviction are protected by two spinlocks. freelist_lck protects the freelist and related data structures, while victimbuf_lck protects information related to the current clock sweep condition. Changed, but I have not used exact wording mentioned above, let me know if new wording used is okay? +always in this list. We also throw buffers into this list if we consider +their pages unlikely to be needed soon; this is done by background process +reclaimer. The list is singly-linked using fields in the I suggest: Allocating pages from this list is much cheaper than running the clock sweep algorithm, which may encounter many buffers that are poor candidates for eviction before finding a good candidate. Therefore, we have a background process called bgreclaimer which works to keep this list populated. Changed as per your suggestion. +Background Reclaimer's Processing +- I suggest titling this section Background Reclaim. +The background reclaimer is designed to move buffers to freelist that are I suggest replacing the first three words of this sentence with bgreclaimer. As per discussion in thread, I have kept it as it is. +and move the the unpinned and zero usage count buffers to freelist. It +keep on doing this until the number of buffers in freelist become equal +high threshold of freelist. s/keep/keeps/ s/become equal/reaches the/ s/high threshold/high water mark/ s/of freelist// Changed as per your suggestion. Please change the other places that say threshold to use the water mark terminology. +if (StrategyMoveBufferToFreeListEnd (bufHdr)) Extra space. + * buffers in consecutive cycles. s/consecutive/later/ +/* Execute the LRU scan */ s/LRU scan/clock sweep/ ? Changed as per your suggestion. +while (tmp_num_to_free 0) I am not sure it's a good idea for this value to be fixed at loop start and then just decremented. Shouldn't we loop and do the whole thing over once we reach the high watermark, only stopping when StrategySyncStartAndEnd() says num_to_free is 0? Okay, changed the loop as per your suggestion. +/* choose next victim buffer to clean. */ This process doesn't clean buffers; it puts them on the freelist. Right. Changed it to match what it does. + * high threshold of freelsit), we drasticaly reduce the odds for Two typos. Fixed. + * of buffers in freelist fall below low threshold of freelist. s/fall/falls/ Changed as per your suggestion. In freelist.c, it seems like a poor idea to have two spinlocks as consecutive structure members; they'll be in the same cache line, leading to false sharing. If we merge them into a single spinlock, does that hurt performance? If we put them further apart, e.g. by moving the freelist_lck to the start of the structure, followed by the latches, and leaving victimbuf_lck where it is, does that help performance? As per discussion, I have kept them as it is and added a comment indicating that we can consider having both locks in separate cache lines. +/* + * If the buffer is pinned or has a nonzero usage_count, we cannot use + * it; discard it and retry. (This can only happen if VACUUM put a + * valid buffer in the freelist and then someone else used it before + * we got to it. It's probably impossible altogether as of 8.3, but + * we'd better check anyway.) + */ + This comment is clearly obsolete. Removed. I have not yet added statistics
Re: [HACKERS] Scaling shared buffer eviction
On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 04/09/14 14:42, Amit Kapila wrote: On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Hi Amit, Results look pretty good. Does it help in the read-write case too? Last time I ran the tpc-b test of pgbench (results of which are posted earlier in this thread), there doesn't seem to be any major gain for that, however for cases where read is predominant, you might see better gains. I am again planing to take that data in next few days. FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation, but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case. Thanks for doing the test. I think if possible you can take the performance data with higher scale factor (4000) as it seems your m/c has 1TB of RAM. You might want to use latest patch I have posted today. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 3, 2014 at 8:03 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila amit.kapil...@gmail.com wrote: +while (tmp_num_to_free 0) I am not sure it's a good idea for this value to be fixed at loop start and then just decremented. It is based on the idea what bgwriter does for num_to_scan and calling it once has advantage that we need to take freelist_lck just once. Right, we shouldn't call it every loop iteration. However, consider this scenario: there are no remaining buffers on the list and the high watermark is 2000. We add 2000 buffers to the list. But by the time we get done, other backends have already done 500 more allocations, so now there are only 1500 buffers on the list. If this should occur, we should add an additional 500 buffers to the list before we consider sleeping. We want bgreclaimer to be able to run continuously if the demand for buffers is high enough. Its not difficult to handle such cases, but it can have downside also for the cases where demand from backends is not high. Consider in above case if instead of 500 more allocations, it just does 5 more allocations, then bgreclaimer will again have to go through the list and move 5 buffers and same can happen again by the time it moves 5 buffers. Another point to keep in mind here is that in this loop we are reducing the usage_count of buffers as well incase we don't find buffer with usage_count=0. OTOH if we let bgreclaimer to go for sleep after it moves initially identified buffers, then the backend which first finds that the buffers in freelist falls below low water mark can wake bgreclaimer. In freelist.c, it seems like a poor idea to have two spinlocks as consecutive structure members; they'll be in the same cache line, leading to false sharing. If we merge them into a single spinlock, does that hurt performance? I have kept them separate so that backends searching for a buffer in freelist doesn't contend with bgreclaimer (while doing clock sweep) or clock sweep being done by other backends. I think it will be bit tricky to devise a test where this can hurt, however it doesn't seem too bad to have two separate locks in this case. It's not. But if they are in the same cache line, they will behave almost like one lock, because the CPU will lock the entire cache line for each atomic op. See Tom's comments upthread. I think to avoid having them in same cache line, we might need to add some padding (at least 72 bytes) as the structure size including both the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes. I have taken performance data as well by keeping them further apart as suggested by you upthread and by introducing padding, but the difference in performance is less than 1.5% (on 64 and 128 client count) which also might be due to variation of data across runs. So now to proceed we have below options: a. use two spinlocks as in patch, but keep them as far apart as possible. This might not have an advantage as compare to what is used currently in patch, but in future we can adding padding to take the advantage if possible (currently on PPC64, it doesn't show any noticeable advantage, however on some other m/c, it might show the advantage). b. use only one spinlock, this can have disadvantage in certain cases as mentioned upthread, however those might not be usual cases, so for now we can consider them as lower priority and can choose this option. Another point in this regard is that I have to make use of volatile pointer to prevent code rearrangement in this case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote: Performance Data: --- Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins All the data is in tps and taken using pgbench read-only load Common configuration remains same as above. Shared_Buffers = 500MB Client Count/Patch_Ver 8 16 32 64 128 HEAD 56248 100112 121341 81128 56552 Patch 59389 112483 157034 185740 166725 .. Observations - 1. Performance improvement is upto 2~3 times for higher client counts (64, 128). 2. For lower client count (8), we can see 2~5 % performance improvement. 3. Overall, this improves the read scalability. 4. For lower number of shared buffers, we see that there is a minor dip in tps even after patch (it might be that we can improve it by tuning higher water mark for the number of buffers on freelist, I will try this by varying high water mark). I have taken performance data by varying high and low mater marks for lower value of shared buffers which is as below: Shared_buffers = 500MB Scale_factor = 3000 HM - High water mark, 0.5 means 0.5% of total shared buffers LM - Low water mark, 20 means 20% of HM. Client Count/Patch_Ver (Data in tps) 128 HM=0.5;LM=20 166725 HM=1;LM=20 166556 HM=2;LM=30 166463 HM=5;LM=30 166107 HM=10;LM=30 167231 Observation a. There is hardly any difference by varying High and Low water marks as compared to default values currently used in patch. b. I think this minor dip as compare to 64 client count is because one this m/c has 64 hardware threads due which scaling beyond 64 client count is difficult and second at relatively lower buffer count (500MB), there is still minor contention around Buf Mapping locks. In general, I think with patch the scaling is much better (2 times) than HEAD, even when shared buffers are less and client count is high, so this is not an issue. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila amit.kapil...@gmail.com wrote: Its not difficult to handle such cases, but it can have downside also for the cases where demand from backends is not high. Consider in above case if instead of 500 more allocations, it just does 5 more allocations, then bgreclaimer will again have to go through the list and move 5 buffers and same can happen again by the time it moves 5 buffers. That's exactly the scenario in which we *want* the looping behavior. If that's happening, then it means it's taking us exactly as long to find 5 buffers as it takes the rest of the system to use 5 buffers. We need to run continuously to keep up. It's not. But if they are in the same cache line, they will behave almost like one lock, because the CPU will lock the entire cache line for each atomic op. See Tom's comments upthread. I think to avoid having them in same cache line, we might need to add some padding (at least 72 bytes) as the structure size including both the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes. I have taken performance data as well by keeping them further apart as suggested by you upthread and by introducing padding, but the difference in performance is less than 1.5% (on 64 and 128 client count) which also might be due to variation of data across runs. So now to proceed we have below options: a. use two spinlocks as in patch, but keep them as far apart as possible. This might not have an advantage as compare to what is used currently in patch, but in future we can adding padding to take the advantage if possible (currently on PPC64, it doesn't show any noticeable advantage, however on some other m/c, it might show the advantage). b. use only one spinlock, this can have disadvantage in certain cases as mentioned upthread, however those might not be usual cases, so for now we can consider them as lower priority and can choose this option. I guess I don't care that much. I only mentioned it because Tom brought it up; I don't really see a big problem with the way you're doing it. Another point in this regard is that I have to make use of volatile pointer to prevent code rearrangement in this case. Yep. Or we need to get off our duff and fix it so that's not necessary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila amit.kapil...@gmail.com wrote: Its not difficult to handle such cases, but it can have downside also for the cases where demand from backends is not high. Consider in above case if instead of 500 more allocations, it just does 5 more allocations, then bgreclaimer will again have to go through the list and move 5 buffers and same can happen again by the time it moves 5 buffers. That's exactly the scenario in which we *want* the looping behavior. If that's happening, then it means it's taking us exactly as long to find 5 buffers as it takes the rest of the system to use 5 buffers. We need to run continuously to keep up. That's what I was thinking, as long as there isn't a lot of overhead to starting and finishing a cycle. If there is, my inclination would be to try to fix that rather than to sleep and hope things don't get out of hand before it wakes up again. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
Robert Haas wrote: On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila amit.kapil...@gmail.com wrote: +Background Reclaimer's Processing +- I suggest titling this section Background Reclaim. I don't mind changing it, but currently used title is based on similar title Background Writer's Processing. It is used in previous paragraph. Is there a reason to title this differently? Oh, I didn't see that. Seems like weird phrasing to me, but I guess it's probably better to keep it consistent. ... or you can also change the other one. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 04/09/14 14:42, Amit Kapila wrote: On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Hi Amit, Results look pretty good. Does it help in the read-write case too? Last time I ran the tpc-b test of pgbench (results of which are posted earlier in this thread), there doesn't seem to be any major gain for that, however for cases where read is predominant, you might see better gains. I am again planing to take that data in next few days. FWIW below are some test results on the 60 core beast with this patch applied to 9.4. I'll need to do more runs to iron out the variation, but it looks like the patch helps the standard (write heavy) pgbench workload a little, and clearly helps the read only case. 4x E7-4890 15 cores each. 1 TB ram 16x Toshiba PX02SS SATA SSD 4x Samsung NVMe XS1715 PCIe SSD Ubuntu 14.04 (Linux 3.13) Postgres 9.4 beta2 + buffer eviction patch v5 Pgbench scale 2000 Non default params: max_connections = 400; shared_buffers = 10GB; maintenance_work_mem = 1GB; effective_io_concurrency = 10; wal_buffers = 256MB; checkpoint_segments = 1920; checkpoint_completion_target = 0.8; ssl = 'off'; wal_sync_method = 'open_datasync'; read write elapsed 600s Clients | tps | tps (unpatched) -+---+ 6 | 8279 | 8328 12 | 16260 | 16381 24 | 23639 | 23451 48 | 31430 | 31004 96 | 38516 | 34777 192 | 33535 | 32443 384 | 27978 | 25068 384 | 30589 | 28798 read only elapsed 300s Clients | tps| tps (unpatched) -++ 6 | 57654 | 57255 12 | 111361 | 112360 24 | 220304 | 187967 48 | 384567 | 230961 96 | 380309 | 241947 192 | 330865 | 214570 384 | 315516 | 207548 Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: I have updated the patch to address the feedback. Main changes are: 1. For populating freelist, have a separate process (bgreclaimer) instead of doing it by bgwriter. 2. Autotune the low and high threshold values for buffers in freelist. I have used the formula as suggested by you upthread. 3. Cleanup of locking regimen as discussed upthread (completely eliminated BufFreelist Lock). 4. Improved comments and general code cleanup. +Background Reclaimer's Processing +- I suggest titling this section Background Reclaim. I don't mind changing it, but currently used title is based on similar title Background Writer's Processing. It is used in previous paragraph. Is there a reason to title this differently? +The background reclaimer is designed to move buffers to freelist that are I suggest replacing the first three words of this sentence with bgreclaimer. Again what I have used is matching with BgWriter's explanation. I thought it would be better if wording used is similar. +while (tmp_num_to_free 0) I am not sure it's a good idea for this value to be fixed at loop start and then just decremented. It is based on the idea what bgwriter does for num_to_scan and calling it once has advantage that we need to take freelist_lck just once. Shouldn't we loop and do the whole thing over once we reach the high watermark, only stopping when StrategySyncStartAndEnd() says num_to_free is 0? Do you mean to say that for high water mark check, we should always refer StrategySyncStartAndEnd() rather than getting the value in begining? Are you thinking that somebody else would have already put some buffers onto freelist which would change initially identified high water mark? I think that can be done only during very few and less used operations. Do you think for that we start referring StrategySyncStartAndEnd() in loop? In freelist.c, it seems like a poor idea to have two spinlocks as consecutive structure members; they'll be in the same cache line, leading to false sharing. If we merge them into a single spinlock, does that hurt performance? I have kept them separate so that backends searching for a buffer in freelist doesn't contend with bgreclaimer (while doing clock sweep) or clock sweep being done by other backends. I think it will be bit tricky to devise a test where this can hurt, however it doesn't seem too bad to have two separate locks in this case. If we put them further apart, e.g. by moving the freelist_lck to the start of the structure, followed by the latches, and leaving victimbuf_lck where it is, does that help performance? I can investigate. +/* + * If the buffer is pinned or has a nonzero usage_count, we cannot use + * it; discard it and retry. (This can only happen if VACUUM put a + * valid buffer in the freelist and then someone else used it before + * we got to it. It's probably impossible altogether as of 8.3, but + * we'd better check anyway.) + */ + This comment is clearly obsolete. Okay, but this patch hasn't changed anything w.r.t above comment, so I haven't changed it. Do you want me to remove second part of comment starting with (This can only happen? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Scaling shared buffer eviction
On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila amit.kapil...@gmail.com wrote: +Background Reclaimer's Processing +- I suggest titling this section Background Reclaim. I don't mind changing it, but currently used title is based on similar title Background Writer's Processing. It is used in previous paragraph. Is there a reason to title this differently? Oh, I didn't see that. Seems like weird phrasing to me, but I guess it's probably better to keep it consistent. +The background reclaimer is designed to move buffers to freelist that are I suggest replacing the first three words of this sentence with bgreclaimer. Again what I have used is matching with BgWriter's explanation. I thought it would be better if wording used is similar. OK. +while (tmp_num_to_free 0) I am not sure it's a good idea for this value to be fixed at loop start and then just decremented. It is based on the idea what bgwriter does for num_to_scan and calling it once has advantage that we need to take freelist_lck just once. Right, we shouldn't call it every loop iteration. However, consider this scenario: there are no remaining buffers on the list and the high watermark is 2000. We add 2000 buffers to the list. But by the time we get done, other backends have already done 500 more allocations, so now there are only 1500 buffers on the list. If this should occur, we should add an additional 500 buffers to the list before we consider sleeping. We want bgreclaimer to be able to run continuously if the demand for buffers is high enough. In freelist.c, it seems like a poor idea to have two spinlocks as consecutive structure members; they'll be in the same cache line, leading to false sharing. If we merge them into a single spinlock, does that hurt performance? I have kept them separate so that backends searching for a buffer in freelist doesn't contend with bgreclaimer (while doing clock sweep) or clock sweep being done by other backends. I think it will be bit tricky to devise a test where this can hurt, however it doesn't seem too bad to have two separate locks in this case. It's not. But if they are in the same cache line, they will behave almost like one lock, because the CPU will lock the entire cache line for each atomic op. See Tom's comments upthread. Okay, but this patch hasn't changed anything w.r.t above comment, so I haven't changed it. Do you want me to remove second part of comment starting with (This can only happen? Right. Clearly it can happen again once we have this patch: that's the entire point of the patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On 03/09/14 16:22, Amit Kapila wrote: On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com wrote: I have yet to collect data under varying loads, however I have collected performance data for 8GB shared buffers which shows reasonably good performance and scalability. I think the main part left for this patch is more data for various loads which I will share in next few days, however I think patch is ready for next round of review, so I will mark it as Needs Review. I have collected more data with the patch. I understand that you have given more review comments due to which patch require changes, however I think it will not effect the performance data to a great extent and I have anyway taken the data, so sharing the same. Performance Data: --- Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout=15min scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins All the data is in tps and taken using pgbench read-only load Common configuration remains same as above. Forgot to mention that data is a median of 3 runs and attached sheet contains data for individual runs. Hi Amit, Results look pretty good. Does it help in the read-write case too? Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Scaling shared buffer eviction
On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Hi Amit, Results look pretty good. Does it help in the read-write case too? Last time I ran the tpc-b test of pgbench (results of which are posted earlier in this thread), there doesn't seem to be any major gain for that, however for cases where read is predominant, you might see better gains. I am again planing to take that data in next few days. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com