Re: [HACKERS] Scaling shared buffer eviction

2014-12-09 Thread Amit Kapila
On Tue, Nov 11, 2014 at 3:00 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-11-11 09:29:22 +, Thom Brown wrote:
  On 26 September 2014 12:40, Amit Kapila amit.kapil...@gmail.com wrote:
 
On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com
   wrote:
   
But this gets at another point: the way we're benchmarking this
right
now, we're really conflating the effects of three different things:
   
1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.
  
   First of all thanks for committing part-1 of this changes and it
   seems you are planing to commit part-3 based on results of tests
   which Andres is planing to do and for remaining part (part-2), today
  
 
  Were parts 2 and 3 committed in the end?

 3 was committed. 2 wasn't because it's not yet clear whether how
 beneficial it is generally.


As shown upthread that this patch (as it stands today) is dependent on
another patch (wait free LW_SHARED acquisition) which is still not
committed and still some more work is needed to clearly show the
gain by this patch, so I have marked it as Returned with Feedback.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-11-11 Thread Thom Brown
On 26 September 2014 12:40, Amit Kapila amit.kapil...@gmail.com wrote:

  On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com
 wrote:
 
  But this gets at another point: the way we're benchmarking this right
  now, we're really conflating the effects of three different things:
 
  1. Changing the locking regimen around the freelist and clocksweep.
  2. Adding a bgreclaimer process.
  3. Raising the number of buffer locking partitions.

 First of all thanks for committing part-1 of this changes and it
 seems you are planing to commit part-3 based on results of tests
 which Andres is planing to do and for remaining part (part-2), today


Were parts 2 and 3 committed in the end?

-- 
Thom


Re: [HACKERS] Scaling shared buffer eviction

2014-11-11 Thread Andres Freund
On 2014-11-11 09:29:22 +, Thom Brown wrote:
 On 26 September 2014 12:40, Amit Kapila amit.kapil...@gmail.com wrote:
 
   On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com
  wrote:
  
   But this gets at another point: the way we're benchmarking this right
   now, we're really conflating the effects of three different things:
  
   1. Changing the locking regimen around the freelist and clocksweep.
   2. Adding a bgreclaimer process.
   3. Raising the number of buffer locking partitions.
 
  First of all thanks for committing part-1 of this changes and it
  seems you are planing to commit part-3 based on results of tests
  which Andres is planing to do and for remaining part (part-2), today
 
 
 Were parts 2 and 3 committed in the end?

3 was committed. 2 wasn't because it's not yet clear whether how
beneficial it is generally.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-27 Thread Amit Kapila
On Tue, Oct 14, 2014 at 3:24 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
 On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com
wrote:
  
   On another point, I think it would be a good idea to rebase the
   bgreclaimer patch over what I committed, so that we have a
   clean patch against master to test with.
 
  Please find the rebased patch attached with this mail.  I have taken
  some performance data as well and done some analysis based on
  the same.
 
 
 

 To reduce above contention, I tried to write a patch to replace spin lock
 used in dynahash to manage free list by atomic operations.  Still there
 is work pending for this patch with respect to ensuring whether the
 approach used in patch is completely sane, however I am posting the
 patch so that others can have a look at it and give me feedback about
 the approach.

After further working on this patch (replacement of spinlock in dynahash),
I found that I need to solve A-B-A  problem for lockless structure as
described in section 5.1 of paper [1].  I could have further pursued to
solve
it by using some additional variables as described in section 5 of paper
[1],
but it seems to me that Robert has already solved it by using some other
technique as proposed by him in patch [2], so it would be waste of effort
for me to pursue on this problem.  So I am not planning to continue on this
patch and marking it as Rejected in CF app.

[1]
http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf
[2] https://commitfest.postgresql.org/action/patch_view?id=1613

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-10-15 Thread Amit Kapila
On Tue, Oct 14, 2014 at 3:32 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:
  After that I observed that contention for LW_SHARED has reduced
  for this load, but it didn't help much in terms of performance, so I
again
  rechecked the profile and this time most of the contention is moved
  to spinlock used in dynahash for buf mapping tables, please refer
  the profile (for 128 client count; Read only load) below:
 
  bgreclaimer patch + wait free lw_shared acquisition patches -
 
--

 This profile is without -O2 again. I really don't think it makes much
 sense to draw much inference from an unoptimized build.

Profile data with -O2 is below.  This shows that top
contributors are calls for BufTableLookup and spin lock caused
by BufTableInsert and BufTableDelete.  To resolve spin lock
contention, patch like above might prove to be useful (although
I have to still evaluate the same).  I would like to once take
LWLOCK_STATS data as well before proceeding further.
Do you have any other ideas?

   11.17%  swapper  [unknown]   [H] 0x011e0328

+   4.62% postgres  postgres[.]
hash_search_with_hash_value
+   4.35%  pgbench  [kernel.kallsyms]   [k] .find_busiest_group

+   3.71% postgres  postgres[.] s_lock

2.56% postgres  [unknown]   [H] 0x01500120

+   2.23%  pgbench  [kernel.kallsyms]   [k] .idle_cpu

+   1.97% postgres  postgres[.] LWLockAttemptLock

+   1.73% postgres  postgres[.] LWLockRelease

+   1.47% postgres  [kernel.kallsyms]   [k]
.__copy_tofrom_user_power7
+   1.44% postgres  postgres[.] GetSnapshotData

+   1.28% postgres  postgres[.] _bt_compare

+   1.04%  swapper  [kernel.kallsyms]   [k] .int_sqrt

+   1.04% postgres  postgres[.] AllocSetAlloc

+   0.97%  pgbench  [kernel.kallsyms]   [k] .default_wake_function





Detailed Data

-   4.62% postgres  postgres[.]
hash_search_with_hash_value
   - hash_search_with_hash_value

  - 2.19% BufTableLookup

 - 2.15% BufTableLookup

  ReadBuffer_common

- ReadBufferExtended

   - 1.32% _bt_relandgetbuf


 - 0.73% BufTableDelete

 - 0.71% BufTableDelete

  ReadBuffer_common

  ReadBufferExtended

  - 0.69% BufTableInsert

 - 0.68% BufTableInsert

  ReadBuffer_common

  ReadBufferExtended

0.66% hash_search_with_hash_value

-   4.35%  pgbench  [kernel.kallsyms]   [k] .find_busiest_group

   - .find_busiest_group

  - 4.28% .find_busiest_group

 - 4.26% .load_balance

- 4.26% .idle_balance

   - .__schedule

  - 4.26% .schedule_hrtimeout_range_clock

   .do_select

   .core_sys_select

-   3.71% postgres  postgres[.] s_lock

   - s_lock

  - 3.19% hash_search_with_hash_value

 - 3.18% hash_search_with_hash_value

- 1.60% BufTableInsert

 ReadBuffer_common

   - ReadBufferExtended

- 1.57% BufTableDelete

 ReadBuffer_common

   - ReadBufferExtended

  - 0.93% index_fetch_heap



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-10-14 Thread Amit Kapila
On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila amit.kapil...@gmail.com wrote:

 On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com
wrote:
 
  On another point, I think it would be a good idea to rebase the
  bgreclaimer patch over what I committed, so that we have a
  clean patch against master to test with.

 Please find the rebased patch attached with this mail.  I have taken
 some performance data as well and done some analysis based on
 the same.



 Below data is median of 3 runs.

 patch_ver/client_count 1 8 32 64 128 256
 HEAD 18884 118628 251093 216294 186625 177505
 PATCH 18743 122578 247243 205521 179712 175031


 Here we can see that the performance dips at higher client
 count(=32) which was quite surprising for me, as I was expecting
 it to improve, because bgreclaimer reduces the contention by making
 buffers available on free list.  So I tried to analyze the situation by
 using perf and found that in above configuration, there is a contention
 around freelist spinlock with HEAD and the same is removed by Patch,
 but still the performance goes down with Patch.  On further analysis, I
 observed that actually after Patch there is an increase in contention
 around ProcArrayLock (shared LWlock) via GetSnapshotData which
 sounds bit odd, but that's what I can see in profiles.  Based on analysis,
 few ideas which I would like to further investigate are:
 a.  As there is an increase in spinlock contention, I would like to check
 with Andres's latest patch which reduces contention around shared
 lwlocks.

I have tried by applying Andres's Wait free LW_SHARED acquisition
patch posted by him at below link along with bgreclaimer patch:
http://www.postgresql.org/message-id/20141008133533.ga5...@alap3.anarazel.de

After that I observed that contention for LW_SHARED has reduced
for this load, but it didn't help much in terms of performance, so I again
rechecked the profile and this time most of the contention is moved
to spinlock used in dynahash for buf mapping tables, please refer
the profile (for 128 client count; Read only load) below:

bgreclaimer patch + wait free lw_shared acquisition patches -
--

9.24%  swapper  [unknown]   [H] 0x011d9c10
+   7.19% postgres  postgres[.] s_lock
+   3.52% postgres  postgres[.] GetSnapshotData
+   3.02% postgres  postgres[.] calc_bucket
+   2.71% postgres  postgres[.]
hash_search_with_hash_value
2.32% postgres  [unknown]   [H] 0x011e0d7c
+   2.17% postgres  postgres[.]
pg_atomic_fetch_add_u32_impl
+   1.84% postgres  postgres[.] AllocSetAlloc
+   1.57% postgres  postgres[.] _bt_compare
+   1.05% postgres  postgres[.] AllocSetFreeIndex
+   1.02% postgres  [kernel.kallsyms]   [k]
.__copy_tofrom_user_power7
+   0.94% postgres  postgres[.] tas
+   0.85%  swapper  [kernel.kallsyms]   [k] .int_sqrt
+   0.80% postgres  postgres[.] pg_encoding_mbcliplen
+   0.78%  pgbench  [kernel.kallsyms]   [k] .find_busiest_group
0.65%  pgbench  [unknown]   [H] 0x011d96e0
+   0.59% postgres  postgres[.] hash_any
+   0.54% postgres  postgres[.] LWLockRelease


Detailed Profile Data
-
-   7.19% postgres  postgres[.] s_lock


   - s_lock


  - 3.79% s_lock


 - 1.69% get_hash_entry


  hash_search_with_hash_value


  BufTableInsert


  BufferAlloc


   0


 - 1.63% hash_search_with_hash_value


- 1.63% BufTableDelete


 BufferAlloc


 ReadBuffer_common


 ReadBufferExtended
  - 1.45% hash_search_with_hash_value


 - 1.45% hash_search_with_hash_value


  BufTableDelete


  BufferAlloc


  ReadBuffer_common
  - 1.43% get_hash_entry


 - 1.43% get_hash_entry


  hash_search_with_hash_value


  BufTableInsert


  BufferAlloc


  ReadBuffer_common
-   3.52% postgres  postgres[.] GetSnapshotData


   - GetSnapshotData


  - 3.50% GetSnapshotData


 - 3.49% GetTransactionSnapshot


- 1.75% exec_bind_message


 PostgresMain


 0


- 1.74% PortalStart


 exec_bind_message


 PostgresMain


 0




To reduce above contention, I tried to write a patch to replace spin lock
used in dynahash to manage free list by atomic operations.  Still there
is work pending for this patch with respect to ensuring whether the
approach used in 

Re: [HACKERS] Scaling shared buffer eviction

2014-10-14 Thread Andres Freund
On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:
 After that I observed that contention for LW_SHARED has reduced
 for this load, but it didn't help much in terms of performance, so I again
 rechecked the profile and this time most of the contention is moved
 to spinlock used in dynahash for buf mapping tables, please refer
 the profile (for 128 client count; Read only load) below:
 
 bgreclaimer patch + wait free lw_shared acquisition patches -
 --
 
 9.24%  swapper  [unknown]   [H] 0x011d9c10
 +   7.19% postgres  postgres[.] s_lock
 +   3.52% postgres  postgres[.] GetSnapshotData
 +   3.02% postgres  postgres[.] calc_bucket
 +   2.71% postgres  postgres[.]
 hash_search_with_hash_value
 2.32% postgres  [unknown]   [H] 0x011e0d7c
 +   2.17% postgres  postgres[.]
 pg_atomic_fetch_add_u32_impl
 +   1.84% postgres  postgres[.] AllocSetAlloc
 +   1.57% postgres  postgres[.] _bt_compare
 +   1.05% postgres  postgres[.] AllocSetFreeIndex
 +   1.02% postgres  [kernel.kallsyms]   [k]
 .__copy_tofrom_user_power7
 +   0.94% postgres  postgres[.] tas
 +   0.85%  swapper  [kernel.kallsyms]   [k] .int_sqrt
 +   0.80% postgres  postgres[.] pg_encoding_mbcliplen
 +   0.78%  pgbench  [kernel.kallsyms]   [k] .find_busiest_group
 0.65%  pgbench  [unknown]   [H] 0x011d96e0
 +   0.59% postgres  postgres[.] hash_any
 +   0.54% postgres  postgres[.] LWLockRelease

This profile is without -O2 again. I really don't think it makes much
sense to draw much inference from an unoptimized build. I realize that
you said that the builds you use for benchmarking don't have that
problem, but that doesn't make this profile meaningful...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-10 Thread Amit Kapila
On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-10-09 16:01:55 +0200, Andres Freund wrote:
 
  I don't think OLTP really is the best test case for this. Especially not
  pgbench with relatilvely small rows *and* a uniform distribution of
  access.
 
  Try parallel COPY TO. Batch write loads is where I've seen this hurt
  badly.


 just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
 hope that with an approach like your's this might become better.

 psql -f /tmp/prepare.sql
 pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Thanks for providing the scripts.  You haven't specified how much data
is present in 'large' file used in tests.  I have tried with different set
of
rows, but I could not see the dip that is present in your data when you
increased shared buffers from 1GB to 8GB, also I don't see any difference
with patch.  BTW, why do you think that for such worklaods this patch can
be helpful, according to my understanding it can be helpful mainly for
read mostly workloads when all the data doesn't fit in shared buffers.

Performance Data
---
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

For 50 rows

Data populated using below statement:
insert into largedata_64
('aa',generate_series(1,50));
copy largedata_64 to '/tmp/large' binary;

pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

shared_buffers - 1GB
-
progress: 7.0 s, 2.7 tps, lat 2326.645 ms stddev 173.506
progress: 11.5 s, 3.5 tps, lat 2295.577 ms stddev 78.949
progress: 15.8 s, 3.7 tps, lat 2298.217 ms stddev 223.346
progress: 20.4 s, 3.5 tps, lat 2350.187 ms stddev 192.312
progress: 25.1 s, 3.4 tps, lat 2280.206 ms stddev 54.580
progress: 31.9 s, 3.4 tps, lat 2408.593 ms stddev 243.230
progress: 45.2 s, 1.1 tps, lat 5120.151 ms stddev 3913.561
progress: 50.5 s, 1.3 tps, lat 8967.954 ms stddev 3384.229
progress: 52.7 s, 2.7 tps, lat 3883.788 ms stddev 1733.293
progress: 55.6 s, 3.2 tps, lat 2684.282 ms stddev 348.615
progress: 58.2 s, 3.4 tps, lat 2602.355 ms stddev 268.718
progress: 60.8 s, 3.1 tps, lat 2361.937 ms stddev 302.643
progress: 65.3 s, 3.5 tps, lat 2341.903 ms stddev 162.338
progress: 74.1 s, 2.6 tps, lat 2720.182 ms stddev 716.425
progress: 76.4 s, 3.5 tps, lat 3023.234 ms stddev 670.473
progress: 80.4 s, 2.0 tps, lat 2795.323 ms stddev 820.429
progress: 85.6 s, 1.9 tps, lat 4756.217 ms stddev 844.284
progress: 91.9 s, 2.2 tps, lat 3996.001 ms stddev 1301.143
progress: 96.6 s, 3.5 tps, lat 2284.419 ms stddev 85.013
progress: 101.1 s, 3.5 tps, lat 2282.848 ms stddev 71.388
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 100 s
number of transactions actually processed: 275
latency average: 2939.784 ms
latency stddev: 1739.974 ms
tps = 2.710138 (including connections establishing)
tps = 2.710208 (excluding connections establishing)



shared_buffers - 8GB

progress: 6.7 s, 2.7 tps, lat 2349.816 ms stddev 212.889
progress: 11.0 s, 3.5 tps, lat 2257.364 ms stddev 141.148
progress: 15.2 s, 3.8 tps, lat 2209.669 ms stddev 127.101
progress: 21.7 s, 3.7 tps, lat 2159.838 ms stddev 92.205
progress: 25.8 s, 3.9 tps, lat 2221.072 ms stddev 283.362
progress: 30.1 s, 3.5 tps, lat 2179.611 ms stddev 152.741
progress: 39.3 s, 2.1 tps, lat 2768.609 ms stddev 1265.508
progress: 50.9 s, 1.1 tps, lat 9361.388 ms stddev 2657.885
progress: 52.9 s, 1.0 tps, lat 2036.098 ms stddev 3.599
progress: 55.2 s, 4.3 tps, lat 2167.688 ms stddev 91.183
progress: 57.6 s, 3.0 tps, lat 2399.219 ms stddev 173.535
progress: 60.2 s, 4.1 tps, lat 2427.273 ms stddev 198.698
progress: 65.2 s, 3.4 tps, lat 2441.630 ms stddev 123.773
progress: 72.4 s, 2.9 tps, lat 2534.631 ms stddev 254.162
progress: 75.0 s, 3.9 tps, lat 2468.266 ms stddev 221.969
progress: 82.3 s, 3.0 tps, lat 2548.690 ms stddev 404.852
progress: 86.7 s, 1.4 tps, lat 3980.576 ms stddev 1205.743
progress: 92.5 s, 1.4 tps, lat 5174.340 ms stddev 643.415
progress: 97.1 s, 3.7 tps, lat 3252.847 ms stddev 1689.268
progress: 101.8 s, 3.4 tps, lat 2346.690 ms stddev 138.251
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 100 s
number of transactions actually processed: 284
latency average: 2856.195 ms
latency stddev: 1740.699 ms
tps = 2.781603 (including connections establishing)
tps = 2.781682 (excluding connections establishing)


For 5000 rows

shared_buffers - 1GB
---
progress: 5.0 s, 357.7 tps, lat 22.295 ms stddev 3.511
progress: 10.0 s, 339.0 tps, lat 23.606 ms stddev 4.388
progress: 15.0 s, 323.4 tps, lat 24.733 ms stddev 5.001
progress: 20.0 s, 329.6 tps, lat 24.258 ms stddev 4.407
progress: 25.0 s, 334.3 tps, lat 23.963 ms stddev 4.126
progress: 30.0 s, 337.5 tps, lat 23.699 ms 

Re: [HACKERS] Scaling shared buffer eviction

2014-10-10 Thread Andres Freund
On 2014-10-10 12:28:13 +0530, Amit Kapila wrote:
 On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  On 2014-10-09 16:01:55 +0200, Andres Freund wrote:
  
   I don't think OLTP really is the best test case for this. Especially not
   pgbench with relatilvely small rows *and* a uniform distribution of
   access.
  
   Try parallel COPY TO. Batch write loads is where I've seen this hurt
   badly.
 
 
  just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
  hope that with an approach like your's this might become better.
 
  psql -f /tmp/prepare.sql
  pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100
 
 Thanks for providing the scripts.  You haven't specified how much data
 is present in 'large' file used in tests.

I don't think it matters much. It should be small enough that you get a
couple TPS over all backends.

 I have tried with different set of
 rows, but I could not see the dip that is present in your data when you
 increased shared buffers from 1GB to 8GB, also I don't see any difference
 with patch.

Interesting. I wonder whether that's because the concurrency wasn't high
enough for that machine to show the problem. I ran the test on my
workstation which has 8 actual cores.

 BTW, why do you think that for such worklaods this patch can
 be helpful, according to my understanding it can be helpful mainly for
 read mostly workloads when all the data doesn't fit in shared buffers.

The performance dip comes from all the backends performing the clock
sweep. As the access is pretty uniform all the buffers start with some
usage count (IIRC 3 in this example. Much worse if 5). Due to the
uniform usagecount the clock sweep frequently has to go several times
through *all* the buffers. That leads to quite horrible performance in
some cases.
I had hoped that bgreclaimer can make that workload les horrible by
funneling most of the accesses through the freelist.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-09 Thread Andres Freund
On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
 On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:
 
  On another point, I think it would be a good idea to rebase the
  bgreclaimer patch over what I committed, so that we have a
  clean patch against master to test with.
 
 Please find the rebased patch attached with this mail.  I have taken
 some performance data as well and done some analysis based on
 the same.
 
 Performance Data
 
 IBM POWER-8 24 cores, 192 hardware threads
 RAM = 492GB
 max_connections =300
 Database Locale =C
 checkpoint_segments=256
 checkpoint_timeout=15min
 shared_buffers=8GB
 scale factor = 5000
 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
 Duration of each individual run = 5mins

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

 patch_ver/client_count 1 8 32 64 128 256
 HEAD 18884 118628 251093 216294 186625 177505
 PATCH 18743 122578 247243 205521 179712 175031

So, pretty much no benefits on any scale, right?


 Here we can see that the performance dips at higher client
 count(=32) which was quite surprising for me, as I was expecting
 it to improve, because bgreclaimer reduces the contention by making
 buffers available on free list.  So I tried to analyze the situation by
 using perf and found that in above configuration, there is a contention
 around freelist spinlock with HEAD and the same is removed by Patch,
 but still the performance goes down with Patch.  On further analysis, I
 observed that actually after Patch there is an increase in contention
 around ProcArrayLock (shared LWlock) via GetSnapshotData which
 sounds bit odd, but that's what I can see in profiles.  Based on analysis,
 few ideas which I would like to further investigate are:
 a.  As there is an increase in spinlock contention, I would like to check
 with Andres's latest patch which reduces contention around shared
 lwlocks.
 b.  Reduce some instructions added by patch in StrategyGetBuffer(),
 like instead of awakening bgreclaimer at low threshold, awaken when
 it tries to do clock sweep.
 

Are you sure you didn't mix up the profiles here? The head vs. patched
look more like profiles from different client counts than different
versions of the code.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-09 Thread Amit Kapila
On Thu, Oct 9, 2014 at 7:31 PM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
  On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com
wrote:
  
   On another point, I think it would be a good idea to rebase the
   bgreclaimer patch over what I committed, so that we have a
   clean patch against master to test with.
 
  Please find the rebased patch attached with this mail.  I have taken
  some performance data as well and done some analysis based on
  the same.
 
  Performance Data
  
  IBM POWER-8 24 cores, 192 hardware threads
  RAM = 492GB
  max_connections =300
  Database Locale =C
  checkpoint_segments=256
  checkpoint_timeout=15min
  shared_buffers=8GB
  scale factor = 5000
  Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
  Duration of each individual run = 5mins

 I don't think OLTP really is the best test case for this. Especially not
 pgbench with relatilvely small rows *and* a uniform distribution of
 access.

 Try parallel COPY TO. Batch write loads is where I've seen this hurt
 badly.

  patch_ver/client_count 1 8 32 64 128 256
  HEAD 18884 118628 251093 216294 186625 177505
  PATCH 18743 122578 247243 205521 179712 175031

 So, pretty much no benefits on any scale, right?

Almost Right, there seem to be slight benefit at client count
8, however that can be due to variation as well.

  Here we can see that the performance dips at higher client
  count(=32) which was quite surprising for me, as I was expecting
  it to improve, because bgreclaimer reduces the contention by making
  buffers available on free list.  So I tried to analyze the situation by
  using perf and found that in above configuration, there is a contention
  around freelist spinlock with HEAD and the same is removed by Patch,
  but still the performance goes down with Patch.  On further analysis, I
  observed that actually after Patch there is an increase in contention
  around ProcArrayLock (shared LWlock) via GetSnapshotData which
  sounds bit odd, but that's what I can see in profiles.  Based on
analysis,
  few ideas which I would like to further investigate are:
  a.  As there is an increase in spinlock contention, I would like to
check
  with Andres's latest patch which reduces contention around shared
  lwlocks.
  b.  Reduce some instructions added by patch in StrategyGetBuffer(),
  like instead of awakening bgreclaimer at low threshold, awaken when
  it tries to do clock sweep.
 

 Are you sure you didn't mix up the profiles here?

I have tried this 2 times, basically I am quite confident from myside,
but human errors can't be ruled out.  I have used below statements:

Steps used for profiling
during configure, use CFLAS=-fno-omit-frame-pointer
Terminal -1
Start Server
Terminal -2
./pgbench -c 64 -j 64 -T 300 -S -M prepared postgres
Terminal-3
perf record -a -g sleep 60  --This command is run after a minute or so of
  -- test start

After test is finished - perf report -g graph,0.5,callee

Do you see any problem in the way I am collecting perf reports?

In any case, I can try once more if you still doubt the profiles.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-10-09 Thread Andres Freund
On 2014-10-09 16:01:55 +0200, Andres Freund wrote:
 On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:
  On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:
  
   On another point, I think it would be a good idea to rebase the
   bgreclaimer patch over what I committed, so that we have a
   clean patch against master to test with.
  
  Please find the rebased patch attached with this mail.  I have taken
  some performance data as well and done some analysis based on
  the same.
  
  Performance Data
  
  IBM POWER-8 24 cores, 192 hardware threads
  RAM = 492GB
  max_connections =300
  Database Locale =C
  checkpoint_segments=256
  checkpoint_timeout=15min
  shared_buffers=8GB
  scale factor = 5000
  Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
  Duration of each individual run = 5mins
 
 I don't think OLTP really is the best test case for this. Especially not
 pgbench with relatilvely small rows *and* a uniform distribution of
 access.
 
 Try parallel COPY TO. Batch write loads is where I've seen this hurt
 badly.

As an example. The attached scripts go from:
progress: 5.3 s, 20.9 tps, lat 368.917 ms stddev 49.655
progress: 10.1 s, 21.0 tps, lat 380.326 ms stddev 64.525
progress: 15.1 s, 14.1 tps, lat 568.108 ms stddev 226.040
progress: 20.4 s, 12.0 tps, lat 634.557 ms stddev 300.519
progress: 25.2 s, 17.5 tps, lat 461.738 ms stddev 136.257
progress: 30.2 s, 9.8 tps, lat 850.766 ms stddev 305.454
progress: 35.3 s, 12.2 tps, lat 670.473 ms stddev 271.075
progress: 40.2 s, 7.9 tps, lat 972.617 ms stddev 313.152
progress: 45.3 s, 14.9 tps, lat 546.056 ms stddev 211.987
progress: 50.2 s, 13.2 tps, lat 610.608 ms stddev 271.780
progress: 55.5 s, 16.9 tps, lat 468.757 ms stddev 156.516
progress: 60.5 s, 14.3 tps, lat 548.913 ms stddev 190.414
progress: 65.7 s, 9.3 tps, lat 821.293 ms stddev 353.665
progress: 70.1 s, 16.0 tps, lat 524.240 ms stddev 174.903
progress: 75.2 s, 17.0 tps, lat 485.692 ms stddev 194.273
progress: 80.2 s, 19.9 tps, lat 396.295 ms stddev 78.891
progress: 85.3 s, 18.3 tps, lat 423.744 ms stddev 105.798
progress: 90.1 s, 14.5 tps, lat 577.373 ms stddev 270.914
progress: 95.3 s, 12.0 tps, lat 649.434 ms stddev 247.001
progress: 100.3 s, 14.6 tps, lat 563.693 ms stddev 275.236
tps = 14.81 (including connections establishing)

to:
progress: 5.1 s, 18.9 tps, lat 409.766 ms stddev 75.032
progress: 10.3 s, 20.2 tps, lat 396.781 ms stddev 67.593
progress: 15.1 s, 19.1 tps, lat 418.545 ms stddev 109.431
progress: 20.3 s, 20.6 tps, lat 388.606 ms stddev 74.259
progress: 25.1 s, 19.5 tps, lat 406.591 ms stddev 109.050
progress: 30.0 s, 19.1 tps, lat 420.199 ms stddev 157.005
progress: 35.0 s, 18.4 tps, lat 421.102 ms stddev 124.019
progress: 40.3 s, 12.3 tps, lat 640.640 ms stddev 88.409
progress: 45.2 s, 12.8 tps, lat 586.471 ms stddev 145.543
progress: 50.5 s, 6.9 tps, lat 1116.603 ms stddev 285.479
progress: 56.2 s, 6.3 tps, lat 1349.055 ms stddev 381.095
progress: 60.6 s, 7.9 tps, lat 1083.745 ms stddev 452.386
progress: 65.0 s, 9.6 tps, lat 805.981 ms stddev 273.845
progress: 71.1 s, 9.6 tps, lat 798.273 ms stddev 184.108
progress: 75.2 s, 9.3 tps, lat 950.131 ms stddev 150.870
progress: 80.8 s, 8.6 tps, lat 899.389 ms stddev 135.090
progress: 85.3 s, 8.8 tps, lat 928.183 ms stddev 152.056
progress: 90.9 s, 8.0 tps, lat 929.737 ms stddev 71.155
progress: 95.7 s, 9.0 tps, lat 968.070 ms stddev 127.824
progress: 100.3 s, 8.7 tps, lat 911.767 ms stddev 130.697

just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
hope that with an approach like your's this might become better.

psql -f /tmp/prepare.sql
pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
CREATE OR REPLACE FUNCTION exec(text) returns text language plpgsql volatile
  AS $f$
BEGIN
  EXECUTE $1;
  RETURN $1;
END;
$f$;
\o /dev/null
SELECT exec('drop table if exists largedata_'||g.i||'; create unlogged table 
largedata_'||g.i||'(data bytea, id serial primary key);') FROM 
generate_series(0, 64) g(i);
\o
COPY largedata_:client_id(data) FROM '/tmp/large' BINARY;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Andres Freund
On 2014-09-25 10:42:29 -0400, Robert Haas wrote:
 On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
  On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
   That leads me to wonder: Have you measured different, lower, number of
   buffer mapping locks? 128 locks is, if we'd as we should align them
   properly, 8KB of memory. Common L1 cache sizes are around 32k...
 
  Amit has some results upthread showing 64 being good, but not as good
  as 128.  I haven't verified that myself, but have no reason to doubt
  it.
 
  How about you push the spinlock change and I crosscheck the partition
  number on a multi socket x86 machine? Seems worthwile to make sure that
  it doesn't cause problems on x86. I seriously doubt it'll, but ...
 
 OK.

Given that the results look good, do you plan to push this?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Robert Haas
On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote:
 OK.

 Given that the results look good, do you plan to push this?

By this, you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

If so, and if you don't have any reservations, yeah I'll go change it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Andres Freund
On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote:
  OK.
 
  Given that the results look good, do you plan to push this?
 
 By this, you mean the increase in the number of buffer mapping
 partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Yes. Now that I think about it I wonder if we shouldn't define 
MAX_SIMUL_LWLOCKS like
#define MAX_SIMUL_LWLOCKS   (NUM_BUFFER_PARTITIONS + 64)
or something like that?

 If so, and if you don't have any reservations, yeah I'll go change it.

Yes, I'm happy with going forward.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Robert Haas
On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  OK.
 
  Given that the results look good, do you plan to push this?

 By this, you mean the increase in the number of buffer mapping
 partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

 Yes. Now that I think about it I wonder if we shouldn't define 
 MAX_SIMUL_LWLOCKS like
 #define MAX_SIMUL_LWLOCKS   (NUM_BUFFER_PARTITIONS + 64)
 or something like that?

Nah.  That assumes NUM_BUFFER_PARTITIONS will always be the biggest
thing, and I don't see any reason to assume that, even if we're making
it true for now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Andres Freund
On 2014-10-02 10:56:05 -0400, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-10-02 10:40:30 -0400, Robert Haas wrote:
  On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com 
  wrote:
   OK.
  
   Given that the results look good, do you plan to push this?
 
  By this, you mean the increase in the number of buffer mapping
  partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?
 
  Yes. Now that I think about it I wonder if we shouldn't define 
  MAX_SIMUL_LWLOCKS like
  #define MAX_SIMUL_LWLOCKS   (NUM_BUFFER_PARTITIONS + 64)
  or something like that?
 
 Nah.  That assumes NUM_BUFFER_PARTITIONS will always be the biggest
 thing, and I don't see any reason to assume that, even if we're making
 it true for now.

The reason I'm suggesting is that NUM_BUFFER_PARTITIONS (and
NUM_LOCK_PARTITIONS) are the cases where we can expect many lwlocks to
be held at the same time. It doesn't seem friendly to users
experimenting with changing this to know about a define that's private
to lwlock.c.
But I'm fine with doing this not at all or separately - there's no need
to actually do it together. 

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Heikki Linnakangas

On 10/02/2014 05:40 PM, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com wrote:

OK.


Given that the results look good, do you plan to push this?


By this, you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?


Hmm, do we actually ever need to hold all the buffer partition locks at 
the same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, 
I couldn't find any place where we'd do that. I bumped up 
NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did 
make check. It passed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Andres Freund
On 2014-10-02 20:04:58 +0300, Heikki Linnakangas wrote:
 On 10/02/2014 05:40 PM, Robert Haas wrote:
 On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
 OK.
 
 Given that the results look good, do you plan to push this?
 
 By this, you mean the increase in the number of buffer mapping
 partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?
 
 Hmm, do we actually ever need to hold all the buffer partition locks at the
 same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I
 couldn't find any place where we'd do that. I bumped up
 NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did
 make check. It passed.

Do a make check-world and it'll hopefully fail ;). Check
pg_buffercache_pages.c.

I'd actually quite like to have a pg_buffercache version that, at least
optionally, doesn't do this, but that's a separate thing.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-02 Thread Robert Haas
On Thu, Oct 2, 2014 at 1:07 PM, Andres Freund and...@2ndquadrant.com wrote:
 Do a make check-world and it'll hopefully fail ;). Check
 pg_buffercache_pages.c.

Yep.  Committed, with an update to the comments in lwlock.c to allude
to the pg_buffercache issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-01 Thread Andres Freund
On 2014-10-01 20:54:39 +0200, Andres Freund wrote:
 Here we go.
 
 Postgres was configured with.
  -c shared_buffers=8GB \
  -c log_line_prefix=[%m %p]  \
  -c log_min_messages=debug1 \
  -p 5440 \
  -c checkpoint_segments=600
  -c max_connections=200

Robert reminded me that I missed to report the hardware aspect of
this...

4x E5-4620  @ 2.20GHz: 32 cores, 64 threads
256GB RAM

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-27 Thread Heikki Linnakangas
Part of this patch was already committed, and the overall patch has had 
its fair share of review for this commitfest, so I'm marking this as 
Returned with feedback. The benchmark results for the bgreclaimer 
showed a fairly small improvement, so it doesn't seem like anyone's 
going to commit the rest of the patch soon, not without more discussion 
and testing anyway. Let's not hold up the commitfest for that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Amit Kapila
 On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com
wrote:

 But this gets at another point: the way we're benchmarking this right
 now, we're really conflating the effects of three different things:

 1. Changing the locking regimen around the freelist and clocksweep.
 2. Adding a bgreclaimer process.
 3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

   Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch +
128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309


Scale Factor - 3000, Shared_buffer - 8GB

   Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch +
128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720


The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer).  Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 1) there is more benefit.  Do you see any merit
in pursuing further to accomplish part-2 as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Heikki Linnakangas

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far*  too wide. A significant portion of
the time is spent comparing buffer/lock tags.


Hmm. Is it the comparing, or calculating the hash? We could precalculate 
the hash for RelFileNode+ForkNumber, and store it RelationData. At a 
lookup, you'd only need to mix in the block number.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Andres Freund
On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:
 On 09/25/2014 05:40 PM, Andres Freund wrote:
 There's two reasons for that: a) dynahash just isn't very good and it
 does a lot of things that will never be necessary for these hashes. b)
 the key into the hash table is*far*  too wide. A significant portion of
 the time is spent comparing buffer/lock tags.
 
 Hmm. Is it the comparing, or calculating the hash?

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Besides making the hashtable more efficent, a smaller key (say, 4 byte
relfilenode, 4 byte blocknumber) would also make using a radix tree or
similar more realistic. I've prototyped that once and it has nice
properties, but the tree is too deep.  Obviousy it'd also help making
buffer descriptors smaller, which is also good from a cache efficiency
perspective...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 First of all thanks for committing part-1 of this changes and it
 seems you are planing to commit part-3 based on results of tests
 which Andres is planing to do and for remaining part (part-2), today
 I have tried some tests, the results of which are as follows:

 Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157732 229547 271536 245295
 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309


 Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157781 212134 202209 171176
 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720


 The results indicates that in all cases there is benefit by doing
 part-2 (bgreclaimer).  Though the benefit at this configuration is
 not high, but might be at some higher configurations
 (scale factor - 1) there is more benefit.  Do you see any merit
 in pursuing further to accomplish part-2 as well?


Interesting results.  Thanks for gathering this data.

If this is the best we can do with the bgreclaimer, I think the case for
pursuing it is somewhat marginal.  The biggest jump you've got seems to be
at scale factor 3000 with 64 clients, where you picked up about 4%.  4%
isn't nothing, but it's not a lot, either.  On the other hand, this might
not be the best we can do.  There may be further improvements to
bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here?
If we had perfectly removed all of the scalability bottlenecks, what would
we expect to see?  You didn't say which machine this testing was done on,
or how many cores it had, but for example on the IBM POWER7 machine, we
probably wouldn't expect the throughput at 64 clients to be 4 times the
throughput at 16 cores because up to 16 clients each one can have a full
CPU core, whereas after that and out to 64 each is getting a hardware
thread, which is not quite as good.  Still, we'd expect performance to go
up, or at least not go down.  Your data shows a characteristic performance
knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
down, and between 64 and 128 we go down more.  You haven't got enough data
points there to show very precisely where the knee is, but unless you
tested this on a smaller box than what you have been using, we're certainly
hitting the knee sometime before we run out of physical cores.  That
implies a remaining contention bottleneck.

My results from yesterday were a bit different.  I tested 1 client, 8
clients, and multiples of 16 clients out to 96.  With
reduce-replacement-locking.patch + 128 buffer mapping partitions,
performance continued to rise all the way out to 96 clients.  It definitely
wasn't linearly, but it went up, not down.  I don't know why this is
different from what you are seeing.  Anyway there's a little more ambiguity
there about how much contention remains, but my bet is that there is at
least some contention that we could still hope to remove.  We need to
understand where that contention is.  Are the buffer mapping locks still
contended?  Is the new spinlock contended?  Are there other contention
points?  I won't be surprised if it turns out that the contention is on the
new spinlock and that a proper design for bgreclaimer is the best way to
remove that contention  but I also won't be surprised if it turns out
that there are bigger wins elsewhere.  So I think you should try to figure
out where the remaining contention is first, and then we can discuss what
to do about it.

On another point, I think it would be a good idea to rebase the bgreclaimer
patch over what I committed, so that we have a clean patch against master
to test with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Heikki Linnakangas

On 09/26/2014 03:26 PM, Andres Freund wrote:

On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far*  too wide. A significant portion of
the time is spent comparing buffer/lock tags.


Hmm. Is it the comparing, or calculating the hash?


Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.


Hmm. We could provide a custom compare function instead of relying on 
memcmp. We can do somewhat better than generic memcmo when we know that 
the BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and 
it's always exactly 20 bytes. I wonder if you're actually just seeing a 
cache miss showing up in the profile, though.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Robert Haas
On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund and...@2ndquadrant.com wrote:
 Neither, really. The hash calculation is visible in the profile, but not
 that pronounced yet. The primary thing noticeable in profiles (besides
 cache efficiency) is the comparison of the full tag after locating a
 possible match in a bucket. 20 byte memcmp's aren't free.

Would it be faster to test the relfilenode first, and then test the
rest only if that matches?

One idea for improving this is to get rid of relation forks.  Like all
true PostgreSQL patriots, I love the free space map and the visibility
map passionately, and I believe that those features are one of the
biggest contributors to the success of the project over the last
half-decade.  But I'd love them even more if they didn't triple our
rate of inode consumption and bloat our buffer tags.  More, it's just
not an extensible mechanism: too many things have to loop over all
forks, and it just doesn't scale to keep adding more of them.  If we
added a metapage to each heap, we could have the FSM and VM have their
own relfilenode and just have the heap point at them.  Or (maybe
better still) we could store the data in the heap itself.

It would be a lot of work, though.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Ants Aasma
On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 Neither, really. The hash calculation is visible in the profile, but not
 that pronounced yet. The primary thing noticeable in profiles (besides
 cache efficiency) is the comparison of the full tag after locating a
 possible match in a bucket. 20 byte memcmp's aren't free.

I'm not arguing against a more efficient hash table, but one simple
win would be to have a buffer tag specific comparison function. I mean
something like the attached patch. This way we avoid the loop counter
overhead, can check the blocknum first and possibly have better branch
prediction.

Do you have a workload where I could test if this helps alleviate the
comparison overhead?

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 7a38f2f..ba37b5f 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -34,6 +34,7 @@ typedef struct
 
 static HTAB *SharedBufHash;
 
+static int BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n);
 
 /*
  * Estimate space needed for mapping hashtable
@@ -46,6 +47,19 @@ BufTableShmemSize(int size)
 }
 
 /*
+ * Compare contents of two buffer tags. We use this instead of the default
+ * memcmp to minimize comparison overhead.
+ */
+static int
+BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n)
+{
+	Assert(offsetof(BufferTag, blockNum) == 2*sizeof(uint64));
+	return (a-blockNum == b-blockNum
+			 ((uint64*)a)[0] == ((uint64*)b)[0]
+			 ((uint64*)a)[1] == ((uint64*)b)[1]) ? 0 : 1;
+}
+
+/*
  * Initialize shmem hash table for mapping buffers
  *		size is the desired hash table size (possibly more than NBuffers)
  */
@@ -61,6 +75,7 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.hash = tag_hash;
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
+	info.match = BufTableCmpBufferTag;
 
 	SharedBufHash = ShmemInitHash(Shared Buffer Lookup Table,
   size, size,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Amit Kapila
On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:

 First of all thanks for committing part-1 of this changes and it
 seems you are planing to commit part-3 based on results of tests
 which Andres is planing to do and for remaining part (part-2), today
 I have tried some tests, the results of which are as follows:

 Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157732 229547 271536 245295
 scalable_buffer_eviction_v9.patch 163762 230753 275147 248309


 Scale Factor - 3000, Shared_buffer - 8GB


Typo here Scale Factor - 3000, Shared_buffer - *2*GB



Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
 + 128 Buf Partitions 157781 212134 202209 171176
 scalable_buffer_eviction_v9.patch 160301 213922 208680 172720


 The results indicates that in all cases there is benefit by doing
 part-2 (bgreclaimer).  Though the benefit at this configuration is
 not high, but might be at some higher configurations
 (scale factor - 1) there is more benefit.  Do you see any merit
 in pursuing further to accomplish part-2 as well?


 Interesting results.  Thanks for gathering this data.


One more point I have missed is that above data is using
-M prepared option of pgbench.



 If this is the best we can do with the bgreclaimer, I think the case for
 pursuing it is somewhat marginal.  The biggest jump you've got seems to be
 at scale factor 3000 with 64 clients, where you picked up about 4%.  4%
 isn't nothing, but it's not a lot, either.  On the other hand, this might
 not be the best we can do.  There may be further improvements to
 bgreclaimer that make the benefit larger.

 Backing up a it, to what extent have we actually solved the problem here?
 If we had perfectly removed all of the scalability bottlenecks, what would
 we expect to see?  You didn't say which machine this testing was done on


It was IBM POWER7
Sorry, I should have mentioned it.


 , or how many cores it had, but for example on the IBM POWER7 machine, we
 probably wouldn't expect the throughput at 64 clients to be 4 times the
 throughput at 16 cores because up to 16 clients each one can have a full
 CPU core, whereas after that and out to 64 each is getting a hardware
 thread, which is not quite as good.  Still, we'd expect performance to go
 up, or at least not go down.  Your data shows a characteristic performance
 knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
 down,


Here another point worth noting is that it goes down between
32 and 64 when shared_buffers is 2GB, however when
shared_buffers are 8GB it doesn't go down between 32 and 64.


 and between 64 and 128 we go down more.  You haven't got enough data
 points there to show very precisely where the knee is, but unless you
 tested this on a smaller box than what you have been using, we're certainly
 hitting the knee sometime before we run out of physical cores.  That
 implies a remaining contention bottleneck.

 My results from yesterday were a bit different.  I tested 1 client, 8
 clients, and multiples of 16 clients out to 96.  With
 reduce-replacement-locking.patch + 128 buffer mapping partitions,
 performance continued to rise all the way out to 96 clients.  It definitely
 wasn't linearly, but it went up, not down.  I don't know why this is
 different from what you are seeing.


I think it is almost same if we consider same configuration
(scale_factor - 3000, shared_buffer - 8GB).


 Anyway there's a little more ambiguity there about how much contention
 remains, but my bet is that there is at least some contention that we could
 still hope to remove.  We need to understand where that contention is.  Are
 the buffer mapping locks still contended?  Is the new spinlock contended?
 Are there other contention points?  I won't be surprised if it turns out
 that the contention is on the new spinlock and that a proper design for
 bgreclaimer is the best way to remove that contention  but I also won't
 be surprised if it turns out that there are bigger wins elsewhere.  So I
 think you should try to figure out where the remaining contention is first,
 and then we can discuss what to do about it.


Make sense.


 On another point, I think it would be a good idea to rebase the
 bgreclaimer patch over what I committed, so that we have a clean patch
 against master to test with.


I think this also makes sense, however I think it is better to first
see where is the bottleneck.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Andres Freund
On 2014-09-26 17:01:52 +0300, Ants Aasma wrote:
 On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
  Neither, really. The hash calculation is visible in the profile, but not
  that pronounced yet. The primary thing noticeable in profiles (besides
  cache efficiency) is the comparison of the full tag after locating a
  possible match in a bucket. 20 byte memcmp's aren't free.
 
 I'm not arguing against a more efficient hash table, but one simple
 win would be to have a buffer tag specific comparison function. I mean
 something like the attached patch. This way we avoid the loop counter
 overhead, can check the blocknum first and possibly have better branch
 prediction.

Heh. Yea. As I wrote to Heikki, 64bit compares were the thing showing
most benefits - at least with my own hashtable implementation.

 Do you have a workload where I could test if this helps alleviate the
 comparison overhead?

Fully cached readonly pgbench workloads with -M prepared already show
it. But it gets more pronounced for workload that access buffers at a
higher frequency.

At two customers I've seen this really badly in the profile because they
have OLTP statements that some index nested loops. Often looking the
same buffer up *over and over*. There often isn't a better plan (with
current pg join support at least) for joining a somewhat small number of
rows out of a large table to small/mid sized table.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Andres Freund
On 2014-09-26 09:59:41 -0400, Robert Haas wrote:
 On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund and...@2ndquadrant.com wrote:
  Neither, really. The hash calculation is visible in the profile, but not
  that pronounced yet. The primary thing noticeable in profiles (besides
  cache efficiency) is the comparison of the full tag after locating a
  possible match in a bucket. 20 byte memcmp's aren't free.
 
 Would it be faster to test the relfilenode first, and then test the
 rest only if that matches?

I tried that and I couldn't see that much benefit. Check my message to
Heikki.

 One idea for improving this is to get rid of relation forks.

I think that's the hardest end to start from. A cool goal, but
hard. Getting rid of the tablespace sound comparatively simple. And even
getting rid of the database in the buffer tag seems to be simpler
although already pretty hard.

 Like all
 true PostgreSQL patriots, I love the free space map and the visibility
 map passionately, and I believe that those features are one of the
 biggest contributors to the success of the project over the last
 half-decade.  But I'd love them even more if they didn't triple our
 rate of inode consumption and bloat our buffer tags.  More, it's just
 not an extensible mechanism: too many things have to loop over all
 forks, and it just doesn't scale to keep adding more of them.  If we
 added a metapage to each heap, we could have the FSM and VM have their
 own relfilenode and just have the heap point at them.  Or (maybe
 better still) we could store the data in the heap itself.
 
 It would be a lot of work, though.  :-(

Yea, it's really hard. And nearly impossible to do without breaking
binary compatibility.

What I've been wondering about was to give individual forks their own
relfilenodes and manage them via columns in pg_class. But that's also a
heck of a lot of work and gets complicated for unlogged relations
because we need to access those during recovery when we don't have
catalog access yet and can't access all databases anyway.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas
On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas robertmh...@gmail.com wrote:
 The patch I attached the first time was just the last commit in the
 git repository where I wrote the patch, rather than the changes that I
 made on top of that commit.  So, yes, the results from the previous
 message are with the patch attached to the follow-up.  I just typed
 the wrong git command when attempting to extract that patch to attach
 it to the email.

Here are some more results.  TL;DR: The patch still looks good, but we
should raise the number of buffer mapping partitions as well.

On the IBM POWER7 machine, I ran read-only pgbench tests with 1
client, 8 clients, and all multiples of 16 up to 96.  I ran these
tests at scale factor 1000 and scale factor 3000. I tested four
builds: master as of commit df4077cda2eae3eb4a5cf387da0c1e7616e73204,
that same commit with the number of buffer mapping partitions raised
to 128, that commit with reduce-replacement-locking.patch applied, and
that commit with reduce-replacement-locking.patch applied AND the
number of buffer mapping partitions raised to 128.  The results from
each configuration are reported in that order on each of the lines
below; each is the median of three results.  shared_buffers=8GB for
all tests.

scale factor 1000
1 8119.907618 8230.853237 8153.515217 8145.045004
8 65457.006762 65826.439701 65851.010116 65703.168020
16 125263.858855 125723.441853 125020.598728 129506.037997
32 176696.288187 182376.232631 178278.917581 186440.340283
48 193251.602743 214243.417591 197958.562641 226782.327868
64 182264.276909 218655.105894 190364.759052 256863.652885
80 171719.210488 203104.673512 179861.241080 274065.020956
96 162525.883898 190960.622943 169759.271356 277820.128782

scale factor 3000
1 7690.357053 7723.925932 7772.207513 7684.079850
8 60789.325087 61688.547446 61485.398967 62546.166411
16 112509.423777 115138.385501 115606.858594 120015.350112
32 147881.211900 161359.994902 153302.501020 173063.463752
48 129748.929652 153986.160920 136164.387103 204935.207578
64 114364.542340 132174.970721 116705.371890 224636.957891
80 101375.265389 117279.931095 102374.794412 232966.076908
96 93144.724830 106676.309224 92787.650325 233862.872939

Analysis:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth.  At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients.  The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation.  At scale factor 3000, the situation
is more complicated.  With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master.  But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

2. To see the effect of increasing the number of buffer mapping locks
to 128, compare the first TPS number in each line to the second, or
the third to the fourth.  Without reduce-replacement-locking.patch,
that's a win at every concurrency level at both scale factors.  With
that patch, the 1 and 8 client tests are small losses at scale factor
1000, and the 1 client test is a small loss at scale factor 3000.

The single-client results, which are often a concern for scalability
patches, bear a bit of further comment.  In this round of testing,
either patch alone improved things slightly, and both patches together
made them slightly worse.  Even if that is reproducible, I don't think
it should be cause for concern, because it tends to indicate (at least
to me) that the shifting around is just the result of slightly
different placement of code across cache lines, or other minor factors
we can't really pin down.  So I'm inclined to (a) push
reduce-replacement-locking.patch and then also (b) bump up the number
of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
accordingly so that pg_buffercache doesn't get unhappy).

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Merlin Moncure
On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
 1. To see the effect of reduce-replacement-locking.patch, compare the
 first TPS number in each line to the third, or the second to the
 fourth.  At scale factor 1000, the patch wins in all of the cases with
 32 or more clients and exactly half of the cases with 1, 8, or 16
 clients.  The variations at low client counts are quite small, and the
 patch isn't expected to do much at low concurrency levels, so that's
 probably just random variation.  At scale factor 3000, the situation
 is more complicated.  With only 16 bufmappinglocks, the patch gets its
 biggest win at 48 clients, and by 96 clients it's actually losing to
 unpatched master.  But with 128 bufmappinglocks, it wins - often
 massively - on everything but the single-client test, which is a small
 loss, hopefully within experimental variation.

 Comments?

Why stop at 128 mapping locks?   Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured?  I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas
On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
 1. To see the effect of reduce-replacement-locking.patch, compare the
 first TPS number in each line to the third, or the second to the
 fourth.  At scale factor 1000, the patch wins in all of the cases with
 32 or more clients and exactly half of the cases with 1, 8, or 16
 clients.  The variations at low client counts are quite small, and the
 patch isn't expected to do much at low concurrency levels, so that's
 probably just random variation.  At scale factor 3000, the situation
 is more complicated.  With only 16 bufmappinglocks, the patch gets its
 biggest win at 48 clients, and by 96 clients it's actually losing to
 unpatched master.  But with 128 bufmappinglocks, it wins - often
 massively - on everything but the single-client test, which is a small
 loss, hopefully within experimental variation.

 Comments?

 Why stop at 128 mapping locks?   Theoretical downsides to having more
 mapping locks have been mentioned a few times but has this ever been
 measured?  I'm starting to wonder if the # mapping locks should be
 dependent on some other value, perhaps the # of shared bufffers...

Good question.  My belief is that the number of buffer mapping locks
required to avoid serious contention will be roughly proportional to
the number of hardware threads.  At the time the value 16 was chosen,
there were probably not more than 8-core CPUs in common use; but now
we've got a machine with 64 hardware threads and, what do you know but
it wants 128 locks.

I think the long-term solution here is that we need a lock-free hash
table implementation for our buffer mapping tables, because I'm pretty
sure that just cranking the number of locks up and up is going to
start to have unpleasant side effects at some point.  We may be able
to buy a few more years by just cranking it up, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund
On 2014-09-25 09:51:17 -0400, Robert Haas wrote:
 On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas robertmh...@gmail.com wrote:
  The patch I attached the first time was just the last commit in the
  git repository where I wrote the patch, rather than the changes that I
  made on top of that commit.  So, yes, the results from the previous
  message are with the patch attached to the follow-up.  I just typed
  the wrong git command when attempting to extract that patch to attach
  it to the email.
 
 Here are some more results.  TL;DR: The patch still looks good, but we
 should raise the number of buffer mapping partitions as well.

  So I'm inclined to (a) push
 reduce-replacement-locking.patch and then also (b) bump up the number
 of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
 accordingly so that pg_buffercache doesn't get unhappy).

I'm happy with that. I don't think it's likely that a moderate increase
in the number of mapping lwlocks will be noticeably bad for any
workload.

One difference is that the total number of lwlock acquirations will be a
bit higher because currently it's more likely for the old and new to
fall into different partitions. But that's not really significant.

The other difference is the number of cachelines touched. Currently, in
concurrent workloads, there's already lots of L1 cache misses around the
buffer mapping locks because they're exclusively owned by a different
core/socket. So, to be effectively worse, the increase would need to
lead to lower overall cache hit rates by them not being in *any* cache
or displacing other content.

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund
On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote:
 On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
  1. To see the effect of reduce-replacement-locking.patch, compare the
  first TPS number in each line to the third, or the second to the
  fourth.  At scale factor 1000, the patch wins in all of the cases with
  32 or more clients and exactly half of the cases with 1, 8, or 16
  clients.  The variations at low client counts are quite small, and the
  patch isn't expected to do much at low concurrency levels, so that's
  probably just random variation.  At scale factor 3000, the situation
  is more complicated.  With only 16 bufmappinglocks, the patch gets its
  biggest win at 48 clients, and by 96 clients it's actually losing to
  unpatched master.  But with 128 bufmappinglocks, it wins - often
  massively - on everything but the single-client test, which is a small
  loss, hopefully within experimental variation.
 
  Comments?
 
 Why stop at 128 mapping locks?   Theoretical downsides to having more
 mapping locks have been mentioned a few times but has this ever been
 measured?  I'm starting to wonder if the # mapping locks should be
 dependent on some other value, perhaps the # of shared bufffers...

Wrong way round. You need to prove the upside of increasing it further,
not the contrary. The primary downside is cache hit ratio and displacing
other cache entries...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas
On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 That leads me to wonder: Have you measured different, lower, number of
 buffer mapping locks? 128 locks is, if we'd as we should align them
 properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128.  I haven't verified that myself, but have no reason to doubt
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund
On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
 On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  That leads me to wonder: Have you measured different, lower, number of
  buffer mapping locks? 128 locks is, if we'd as we should align them
  properly, 8KB of memory. Common L1 cache sizes are around 32k...
 
 Amit has some results upthread showing 64 being good, but not as good
 as 128.  I haven't verified that myself, but have no reason to doubt
 it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund
On 2014-09-25 10:09:30 -0400, Robert Haas wrote:
 I think the long-term solution here is that we need a lock-free hash
 table implementation for our buffer mapping tables, because I'm pretty
 sure that just cranking the number of locks up and up is going to
 start to have unpleasant side effects at some point.  We may be able
 to buy a few more years by just cranking it up, though.

I think mid to long term we actually need something else than a
hashtable. Capable of efficiently looking for the existance of
'neighboring' buffers so we can intelligently prefetch far enough that
the read actually completes when we get there. Also I'm pretty sure that
we'll need a way to efficiently remove all buffers for a relfilenode
from shared buffers - linearly scanning for that isn't a good
solution. So I think we need a different data structure.

I've played a bit around with just replacing buf_table.c with a custom
handrolled hashtable because I've seen more than one production workload
where hash_search_with_hash_value() is both cpu and cache miss wise
top#1 of profiles. With most calls coming from the buffer mapping and
then from the lock manager.

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is *far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

The aforementioned replacement hash table was a good bit faster for
fully cached workloads - but at the time I wrote I could still make it
crash in very high cache pressure workloads, so that should be taken
with a fair bit of salt.

I think we can comparatively easily get rid of the tablespace in buffer
tags. Getting rid of the database already would be a fair bit harder. I
haven't really managed to get an idea how to remove the fork number
without making the catalog much more complicated.  I don't think we can
go too long without at least some of these steps :(.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Merlin Moncure
On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 Why stop at 128 mapping locks?   Theoretical downsides to having more
 mapping locks have been mentioned a few times but has this ever been
 measured?  I'm starting to wonder if the # mapping locks should be
 dependent on some other value, perhaps the # of shared bufffers...

 Wrong way round. You need to prove the upside of increasing it further,
 not the contrary. The primary downside is cache hit ratio and displacing
 other cache entries...

I can't do that because I don't have the hardware.  I wasn't
suggesting to just set it but to measure the affects of setting it.
But the benefits from going from 16 to 128 are pretty significant at
least on this hardware; I'm curious how much further it can be
pushed...what's wrong with trying it out?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas
On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
 On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  That leads me to wonder: Have you measured different, lower, number of
  buffer mapping locks? 128 locks is, if we'd as we should align them
  properly, 8KB of memory. Common L1 cache sizes are around 32k...

 Amit has some results upthread showing 64 being good, but not as good
 as 128.  I haven't verified that myself, but have no reason to doubt
 it.

 How about you push the spinlock change and I crosscheck the partition
 number on a multi socket x86 machine? Seems worthwile to make sure that
 it doesn't cause problems on x86. I seriously doubt it'll, but ...

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund
On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote:
 On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote:
  Why stop at 128 mapping locks?   Theoretical downsides to having more
  mapping locks have been mentioned a few times but has this ever been
  measured?  I'm starting to wonder if the # mapping locks should be
  dependent on some other value, perhaps the # of shared bufffers...
 
  Wrong way round. You need to prove the upside of increasing it further,
  not the contrary. The primary downside is cache hit ratio and displacing
  other cache entries...
 
 I can't do that because I don't have the hardware.

One interesting part of this is making sure it doesn't regress
older/smaller machines. So at least that side you could check...

 what's wrong with trying it out?

If somebody is willing to do it: nothing. I'd just much rather do the,
by now proven, simple change before starting with more complex
solutions.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Fri, Sep 19, 2014 at 7:21 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 Specific numbers of both the configurations for which I have
 posted data in previous mail are as follows:

 Scale Factor - 800
 Shared_Buffers - 12286MB (Total db size is 12288MB)
 Client and Thread Count = 64
 buffers_touched_freelist - count of buffers that backends found touched
 after
 popping from freelist.
 buffers_backend_clocksweep - count of buffer allocations not satisfied
 from freelist

   buffers_alloc 1531023  buffers_backend_clocksweep 0
 buffers_touched_freelist 0



I didn't believe these numbers, so I did some testing.  I used the same
configuration you mention here, scale factor = 800, shared_buffers = 12286
MB, and I also saw buffers_backend_clocksweep = 0.  I didn't see
buffers_touched_freelist showing up anywhere, so I don't know whether that
would have been zero or not.  Then I tried reducing the high watermark for
the freelist from 2000 buffers to 25 buffers, and
buffers_backend_clocksweep was *still* 0.  At that point I started to smell
a rat.  It turns out that, with this test configuration, there's no buffer
allocation going on at all.  Everything fits in shared_buffers, or it did
on my test.  I had to reduce shared_buffers down to 10491800kB before I got
any significant buffer eviction.

At that level, a 100-buffer high watermark wasn't sufficient to prevent the
freelist from occasionally going empty.  A 2000-buffer high water mark was
by and large sufficient, although I was able to see small numbers of
buffers being allocated via clocksweep right at the very beginning of the
test, I guess before the reclaimer really got cranking.  So the watermarks
seem to be broadly in the right ballpark, but I think the statistics
reporting needs improving.  We need an easy way to measure the amount of
work that bgreclaimer is actually doing.

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.
2. The number of times a backend has run the clocksweep.
3. The number of buffers past which the reclaimer has advanced the clock
sweep (i.e. the number of buffers it had to examine in order to reclaim the
number counted by #1).
4. The number of buffers past which a backend has advanced the clocksweep
(i.e. the number of buffers it had to examine in order to allocate the
number of buffers count by #3).
5. The number of buffers allocated from the freelist which the backend did
not use because they'd been touched (what you're calling
buffers_touched_freelist).

It's hard to come up with good names for all of these things that are
consistent with the somewhat wonky existing names.  Here's an attempt:

1. bgreclaim_freelist
2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I
think we want to make it more parallel with buffers_alloc, which is the
number of buffers allocated, not buffers_backend, the number of buffers
*written* by a backend)
3. clocksweep_bgreclaim
4. clocksweep_backend
5. freelist_touched

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Andres Freund
Hi,

On 2014-09-23 10:31:24 -0400, Robert Haas wrote:
 I suggest we count these things:
 
 1. The number of buffers the reclaimer has put back on the free list.
 2. The number of times a backend has run the clocksweep.
 3. The number of buffers past which the reclaimer has advanced the clock
 sweep (i.e. the number of buffers it had to examine in order to reclaim the
 number counted by #1).

 4. The number of buffers past which a backend has advanced the clocksweep
 (i.e. the number of buffers it had to examine in order to allocate the
 number of buffers count by #3).

 5. The number of buffers allocated from the freelist which the backend did
 not use because they'd been touched (what you're calling
 buffers_touched_freelist).

Sounds good.

 It's hard to come up with good names for all of these things that are
 consistent with the somewhat wonky existing names.  Here's an attempt:
 
 1. bgreclaim_freelist

bgreclaim_alloc_clocksweep?

 2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I
 think we want to make it more parallel with buffers_alloc, which is the
 number of buffers allocated, not buffers_backend, the number of buffers
 *written* by a backend)
 3. clocksweep_bgreclaim
 4. clocksweep_backend

I think bgreclaim/backend should always be either be a prefix or a
postfix. But not one in some variables and some in another.

 5. freelist_touched

I wonder if we shouldn't move all this to a new view, instead of
stuffing it somewhere where it really doesn't belong. pg_stat_buffers or
something like it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas robertmh...@gmail.com wrote:
 [ review ]

Oh, by the way, I noticed that this patch breaks pg_buffercache.  If
we're going to have 128 lock partitions, we need to bump
MAX_SIMUL_LWLOCKS.

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

I think it's pretty clear that #1 and #2 are a good idea.  #3 is a
mixed bag, and it might account for the regressions you saw on some
test cases.  Increasing the number of buffer mapping locks means that
those locks take up more cache lines, which could slow things down in
cases where there's no reduction in contention.  It also means that
the chances of an allocated buffer ending up in the same buffer
mapping lock partition are 1/128 instead of 1/16, which means about
5.4 additional lwlock acquire/release cycles per 100 allocations.
That's not a ton, but it's not free either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote:
 But this gets at another point: the way we're benchmarking this right
 now, we're really conflating the effects of three different things:

 1. Changing the locking regimen around the freelist and clocksweep.
 2. Adding a bgreclaimer process.
 3. Raising the number of buffer locking partitions.

I did some more experimentation on this.  Attached is a patch that
JUST does #1, and, as previously suggested, it uses a single spinlock
instead of using two of them that are probably in the same cacheline.
Without this patch, on a 32-client, read-only pgbench at scale factor
1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
LWLock-inspired preemption:

 - LWLockAcquire
+ 68.41% ReadBuffer_common
+ 31.59% StrategyGetBuffer

With the patch, unsurprisingly, StrategyGetBuffer disappears and the
only lwlocks that are causing context-switches are the individual
buffer locks.  But are we suffering spinlock contention instead as a
result?   Yes, but not much.  s_lock is at 0.41% in the corresponding
profile, and only 6.83% of those calls are from the patched
StrategyGetBuffer.  In a similar profile of master, s_lock is at
0.31%.  So there's a bit of additional s_lock contention, but I think
it's considerably less than the contention over the lwlock it's
replacing, because the spinlock is only held for the minimal amount of
time needed, whereas the lwlock could be held across taking and
releasing many individual buffer locks.

TPS results are a little higher with the patch - these are alternating
5 minute runs:

master tps = 176010.647944 (including connections establishing)
master tps = 176615.291149 (including connections establishing)
master tps = 175648.370487 (including connections establishing)
reduce-replacement-locking tps = 177888.734320 (including connections
establishing)
reduce-replacement-locking tps = 177797.842410 (including connections
establishing)
reduce-replacement-locking tps = 177894.822656 (including connections
establishing)

The picture is similar at 64 clients, but the benefit is a little more:

master tps = 179037.231597 (including connections establishing)
master tps = 180500.937068 (including connections establishing)
master tps = 181565.706514 (including connections establishing)
reduce-replacement-locking tps = 185741.503425 (including connections
establishing)
reduce-replacement-locking tps = 188598.728062 (including connections
establishing)
reduce-replacement-locking tps = 187340.977277 (including connections
establishing)

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

reduce-replacement-locking-128 tps = 251001.812843 (including
connections establishing)
reduce-replacement-locking-128 tps = 247368.925927 (including
connections establishing)
reduce-replacement-locking-128 tps = 250775.304177 (including
connections establishing)

The performance also goes up if I do that on master, but the effect is
substantially less:

master-128 tps = 219301.492902 (including connections establishing)
master-128 tps = 219786.249076 (including connections establishing)
master-128 tps = 219821.220271 (including connections establishing)

I think this shows pretty clearly that, even without the bgreclaimer,
there's a lot of merit in getting rid of BufFreelistLock and using a
spinlock held for the absolutely minimal number of instructions
instead.  There's already some benefit without doing anything about
the buffer mapping locks, and we'll get a lot more benefit once that
issue is addressed.  I think we need to do some serious study to see
whether bgreclaimer is even necessary, because it looks like this
change alone, which is much simpler, takes a huge bite out of the
scalability problem.

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch.  To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win.  We can then deal with
the other issues on their own merits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit df4077cda2eae3eb4a5cf387da0c1e7616e73204
Author: Robert Haas rh...@postgresql.org
Date:   Mon Sep 22 16:42:14 2014 -0400

Remove volatile qualifiers from lwlock.c.

Now that spinlocks (hopefully!) act as compiler barriers, as of commit
0709b7ee72e4bc71ad07b7120acd117265ab51d0, this should be safe.  This
serves as a demonstration of the new coding style, and may be optimized
better on some machines as well.

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..66fb2e4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas robertmh...@gmail.com wrote:
 I did some more experimentation on this.  Attached is a patch that
 JUST does #1, and, ...

...and that was the wrong patch.  Thanks to Heikki for point that out.

Second try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..a4ebbcc 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,12 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
+* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
 exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+buffers for replacement.  A spinlock is used here rather than a lightweight
+lock for efficiency; no other locks of any sort should be acquired while
+buffer_strategy_lock is held.  This is essential to allow buffer replacement
+to happen in multiple backends with reasonable concurrency.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -165,7 +163,7 @@ consider their pages unlikely to be needed soon; however, the current
 algorithm never does that.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the buffer_strategy_lock, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
 need to take system-wide locks during common operations.  It works like
@@ -178,25 +176,26 @@ buffer reference count, so it's nearly free.)
 
 The clock hand is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+buffer_strategy_lock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain buffer_strategy_lock.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer.  Release
+buffer_strategy_lock.  If the buffer is pinned or has a nonzero usage count,
+it cannot be used; ignore it go back to step 1.  Otherwise, pin the buffer,
+and return it.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
-circularly advance nextVictimBuffer for next time.
+3. Otherwise, the buffer free list is empty.  Select the buffer pointed to by
+nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
+Release buffer_strategy_lock.
 
 4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
-examine the next buffer.
+be used.  Decrement its usage count (if nonzero), reacquire
+buffer_strategy_lock, and return to step 3 to examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+5. Pin the selected buffer, and return.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +258,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take buffer_strategy_lock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas robertmh...@gmail.com wrote:
 I did some more experimentation on this.  Attached is a patch that
 JUST does #1, and, ...

 ...and that was the wrong patch.  Thanks to Heikki for point that out.
 Second try.

But the results you gave in the previous message were correctly
attributed?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 5:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas robertmh...@gmail.com wrote:
 I did some more experimentation on this.  Attached is a patch that
 JUST does #1, and, ...

 ...and that was the wrong patch.  Thanks to Heikki for point that out.
 Second try.

 But the results you gave in the previous message were correctly
 attributed?

The patch I attached the first time was just the last commit in the
git repository where I wrote the patch, rather than the changes that I
made on top of that commit.  So, yes, the results from the previous
message are with the patch attached to the follow-up.  I just typed
the wrong git command when attempting to extract that patch to attach
it to the email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Gregory Smith

On 9/23/14, 10:31 AM, Robert Haas wrote:

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.
2. The number of times a backend has run the clocksweep.
3. The number of buffers past which the reclaimer has advanced the 
clock sweep (i.e. the number of buffers it had to examine in order to 
reclaim the number counted by #1).
4. The number of buffers past which a backend has advanced the 
clocksweep (i.e. the number of buffers it had to examine in order to 
allocate the number of buffers count by #3).
5. The number of buffers allocated from the freelist which the backend 
did not use because they'd been touched (what you're calling 
buffers_touched_freelist).


All sound reasonable.  To avoid wasting time here, I think it's only 
worth doing all of these as DEBUG level messages for now.  Then only go 
through the overhead of exposing the ones that actually seem relevant.  
That's what I did for the 8.3 work, and most of that data at this level 
was barely relevant to anyone but me then or since. We don't want the 
system views to include so much irrelevant trivia that finding the 
important parts becomes overwhelming.


I'd like to see that level of instrumentation--just the debug level 
messages--used to quantify the benchmarks that people are running 
already, to prove they are testing what they think they are.  That would 
have caught the test error you already stumbled on for example.  Simple 
paranoia says there may be more issues like that hidden in here 
somewhere, and this set you've identified should find any more of them 
around.


If all that matches up so the numbers for the new counters seem sane, I 
think that's enough to commit the first basic improvement here.  Then 
make a second pass over exposing the useful internals that let everyone 
quantify either the improvement or things that may need to be tunable.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Andres Freund
On 2014-09-23 16:29:16 -0400, Robert Haas wrote:
 On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas robertmh...@gmail.com wrote:
  But this gets at another point: the way we're benchmarking this right
  now, we're really conflating the effects of three different things:
 
  1. Changing the locking regimen around the freelist and clocksweep.
  2. Adding a bgreclaimer process.
  3. Raising the number of buffer locking partitions.
 
 I did some more experimentation on this.  Attached is a patch that
 JUST does #1, and, as previously suggested, it uses a single spinlock
 instead of using two of them that are probably in the same cacheline.
 Without this patch, on a 32-client, read-only pgbench at scale factor
 1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
 LWLock-inspired preemption:
 
  - LWLockAcquire
 + 68.41% ReadBuffer_common
 + 31.59% StrategyGetBuffer
 
 With the patch, unsurprisingly, StrategyGetBuffer disappears and the
 only lwlocks that are causing context-switches are the individual
 buffer locks.  But are we suffering spinlock contention instead as a
 result?   Yes, but not much.  s_lock is at 0.41% in the corresponding
 profile, and only 6.83% of those calls are from the patched
 StrategyGetBuffer.  In a similar profile of master, s_lock is at
 0.31%.  So there's a bit of additional s_lock contention, but I think
 it's considerably less than the contention over the lwlock it's
 replacing, because the spinlock is only held for the minimal amount of
 time needed, whereas the lwlock could be held across taking and
 releasing many individual buffer locks.

Am I understanding you correctly that you also measured context switches
for spinlocks? If so, I don't think that's a valid comparison. LWLocks
explicitly yield the CPU as soon as there's any contention while
spinlocks will, well, spin. Sure they also go to sleep, but it'll take
longer.

It's also worthwile to remember in such comparisons that lots of the
cost of spinlocks will be in the calling routines, not s_lock() - the
first TAS() is inlined into it. And that's the one that'll incur cache
misses and such...

Note that I'm explicitly *not* doubting the use of a spinlock
itself. Given the short acquiration times and the exclusive use of
exlusive acquiration a spinlock makes more sense. The lwlock's spinlock
alone will have about as much contention.

I think it might be possible to construct some cases where the spinlock
performs worse than the lwlock. But I think those will be clearly in the
minority. And at least some of those will be fixed by bgwriter. As an
example consider a single backend COPY ... FROM of large files with a
relatively large s_b. That's a seriously bad workload for the current
victim buffer selection because frequently most of the needs to be
searched for a buffer with usagecount 0. And this patch will double the
amount of atomic ops during that.

Let me try to quantify that.

 What's interesting is that I can't see in the perf output any real
 sign that the buffer mapping locks are slowing things down, but they
 clearly are, because when I take this patch and also boost
 NUM_BUFFER_PARTITIONS to 128, the performance goes up:

What did events did you profile?

 I welcome further testing and comments, but my current inclination is
 to go ahead and push the attached patch.  To my knowledge, nobody has
 at any point objected to this aspect of what's being proposed, and it
 looks safe to me and seems to be a clear win.  We can then deal with
 the other issues on their own merits.

I've took a look at this, and all the stuff I saw that I disliked were
there before this patch. So +1 for going ahead.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 6:02 PM, Gregory Smith gregsmithpg...@gmail.com wrote:
 On 9/23/14, 10:31 AM, Robert Haas wrote:
 I suggest we count these things:

 1. The number of buffers the reclaimer has put back on the free list.
 2. The number of times a backend has run the clocksweep.
 3. The number of buffers past which the reclaimer has advanced the clock
 sweep (i.e. the number of buffers it had to examine in order to reclaim the
 number counted by #1).
 4. The number of buffers past which a backend has advanced the clocksweep
 (i.e. the number of buffers it had to examine in order to allocate the
 number of buffers count by #3).
 5. The number of buffers allocated from the freelist which the backend did
 not use because they'd been touched (what you're calling
 buffers_touched_freelist).

 All sound reasonable.  To avoid wasting time here, I think it's only worth
 doing all of these as DEBUG level messages for now.  Then only go through
 the overhead of exposing the ones that actually seem relevant.  That's what
 I did for the 8.3 work, and most of that data at this level was barely
 relevant to anyone but me then or since. We don't want the system views to
 include so much irrelevant trivia that finding the important parts becomes
 overwhelming.

I think we expose far too little information in our system views.
Just to take one example, we expose no useful information about lwlock
acquire or release, but a lot of real-world performance problems are
caused by lwlock contention.  There are of course difficulties in
exposing huge numbers of counters, but we're not talking about many
here, so I'd lean toward exposing them in the final patch if they seem
at all useful.

 I'd like to see that level of instrumentation--just the debug level
 messages--used to quantify the benchmarks that people are running already,
 to prove they are testing what they think they are.  That would have caught
 the test error you already stumbled on for example.  Simple paranoia says
 there may be more issues like that hidden in here somewhere, and this set
 you've identified should find any more of them around.

Right.

 If all that matches up so the numbers for the new counters seem sane, I
 think that's enough to commit the first basic improvement here.  Then make a
 second pass over exposing the useful internals that let everyone quantify
 either the improvement or things that may need to be tunable.

Well, I posted a patch a bit ago that I think is the first basic
improvement - and none of these counters are relevant to that.  It
doesn't add a new background process or anything; it just does pretty
much the same thing we do now with less-painful locking.  There are no
water marks to worry about, or tunable thresholds, or anything; and
because it's so much simpler, it's far easier to reason about than the
full patch, which is why I feel quite confident pressing on toward a
commit.

Once that is in, I think we should revisit the idea of a bgreclaimer
process, and see how much more that improves things - if at all - on
top of what that basic patch already does.  For that we'll need these
counters, and maybe others.  But let's make that phase two.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund and...@2ndquadrant.com wrote:
 Am I understanding you correctly that you also measured context switches
 for spinlocks? If so, I don't think that's a valid comparison. LWLocks
 explicitly yield the CPU as soon as there's any contention while
 spinlocks will, well, spin. Sure they also go to sleep, but it'll take
 longer.

No, I measured CPU consumption attributable to s_lock.

(I checked context-switches, too.)

 It's also worthwile to remember in such comparisons that lots of the
 cost of spinlocks will be in the calling routines, not s_lock() - the
 first TAS() is inlined into it. And that's the one that'll incur cache
 misses and such...

True.  I can check that - I did not.

 Note that I'm explicitly *not* doubting the use of a spinlock
 itself. Given the short acquiration times and the exclusive use of
 exlusive acquiration a spinlock makes more sense. The lwlock's spinlock
 alone will have about as much contention.

Right.

 I think it might be possible to construct some cases where the spinlock
 performs worse than the lwlock. But I think those will be clearly in the
 minority. And at least some of those will be fixed by bgwriter. As an
 example consider a single backend COPY ... FROM of large files with a
 relatively large s_b. That's a seriously bad workload for the current
 victim buffer selection because frequently most of the needs to be
 searched for a buffer with usagecount 0. And this patch will double the
 amount of atomic ops during that.

It will actually be far worse than that, because we'll acquire and
release the spinlock for every buffer over which we advance the clock
sweep, instead of just once for the whole thing.  That's reason to
hope that a smart bgreclaimer process may be a good improvement on top
of this.  It can do things like advance the clock sweep hand 16
buffers at a time and then sweep them all after-the-fact, reducing
contention on the mutex by an order-of-magnitude, if that turns out to
be an important consideration.

But I think it's right to view that as something we need to test vs.
the baseline established by this patch.  What's clear today is that
workloads which stress buffer-eviction fall to pieces, because the
entire buffer-eviction process is essentially single-threaded.  One
process can't begin evicting a buffer until another has finished doing
so.  This lets multiple backends do that at the same time.  We may
find cases where that leads to an unpleasant amount of contention, but
since we have several ideas for how to mitigate that as needs be, I
think it's OK to go ahead.  The testing we're doing on the combined
patch is conflating the effects the new locking regimen with however
the bgreclaimer affects things, and it's very clear to me now that we
need to make sure those are clearly separate.

 Let me try to quantify that.

Please do.

 What's interesting is that I can't see in the perf output any real
 sign that the buffer mapping locks are slowing things down, but they
 clearly are, because when I take this patch and also boost
 NUM_BUFFER_PARTITIONS to 128, the performance goes up:

 What did events did you profile?

cs.

 I welcome further testing and comments, but my current inclination is
 to go ahead and push the attached patch.  To my knowledge, nobody has
 at any point objected to this aspect of what's being proposed, and it
 looks safe to me and seems to be a clear win.  We can then deal with
 the other issues on their own merits.

 I've took a look at this, and all the stuff I saw that I disliked were
 there before this patch. So +1 for going ahead.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Andres Freund
On 2014-09-23 19:21:10 -0400, Robert Haas wrote:
 On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund and...@2ndquadrant.com wrote:
  I think it might be possible to construct some cases where the spinlock
  performs worse than the lwlock. But I think those will be clearly in the
  minority. And at least some of those will be fixed by bgwriter.

Err, this should read bgreclaimer, not bgwriter.

  As an
  example consider a single backend COPY ... FROM of large files with a
  relatively large s_b. That's a seriously bad workload for the current
  victim buffer selection because frequently most of the needs to be
  searched for a buffer with usagecount 0. And this patch will double the
  amount of atomic ops during that.
 
 It will actually be far worse than that, because we'll acquire and
 release the spinlock for every buffer over which we advance the clock
 sweep, instead of just once for the whole thing.

I said double, because we already acquire the buffer header's spinlock
every tick.

 That's reason to hope that a smart bgreclaimer process may be a good
 improvement on top of this.

Right. That's what I was trying to say with bgreclaimer above.

 But I think it's right to view that as something we need to test vs.
 the baseline established by this patch.

Agreed. I think the possible downsides are at the very fringe of already
bad cases. That's why I agreed that you should go ahead.

  Let me try to quantify that.
 
 Please do.

I've managed to find a ~1.5% performance regression. But the setup was
plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into
a fillfactor 10 table with the column set to PLAIN storage. With the
resulting table size chosen so it's considerably bigger than s_b, but
smaller than the dirty writeback limit of the kernel.

That's perfectly reasonable.

I can think of a couple other cases, but they're all similarly absurd.

  What's interesting is that I can't see in the perf output any real
  sign that the buffer mapping locks are slowing things down, but they
  clearly are, because when I take this patch and also boost
  NUM_BUFFER_PARTITIONS to 128, the performance goes up:
 
  What did events did you profile?
 
 cs.

Ah. My guess is that most of the time will probably actually be spent in
the lwlock's spinlock, not the the lwlock putting itself to sleep.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Robert Haas
On Tue, Sep 23, 2014 at 7:42 PM, Andres Freund and...@2ndquadrant.com wrote:
 It will actually be far worse than that, because we'll acquire and
 release the spinlock for every buffer over which we advance the clock
 sweep, instead of just once for the whole thing.

 I said double, because we already acquire the buffer header's spinlock
 every tick.

Oh, good point.

  Let me try to quantify that.

 Please do.

 I've managed to find a ~1.5% performance regression. But the setup was
 plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into
 a fillfactor 10 table with the column set to PLAIN storage. With the
 resulting table size chosen so it's considerably bigger than s_b, but
 smaller than the dirty writeback limit of the kernel.

 That's perfectly reasonable.

 I can think of a couple other cases, but they're all similarly absurd.

Well, it's not insane to worry about such things, but if you can only
manage 1.5% on such an extreme case, I'm encouraged.  This is killing
us on OLTP workloads, and fixing that is a lot more important than a
couple percent on an extreme case.

 Ah. My guess is that most of the time will probably actually be spent in
 the lwlock's spinlock, not the the lwlock putting itself to sleep.

Ah, OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-23 Thread Gregory Smith

On 9/23/14, 7:13 PM, Robert Haas wrote:
I think we expose far too little information in our system views. Just 
to take one example, we expose no useful information about lwlock 
acquire or release, but a lot of real-world performance problems are 
caused by lwlock contention.
I sent over a proposal for what I was calling Performance Events about a 
year ago.  The idea was to provide a place to save data about lock 
contention, weird checkpoint sync events, that sort of thing.  Replacing 
log parsing to get at log_lock_waits data was my top priority.  Once 
that's there, lwlocks was an obvious next target.  Presumably we just 
needed collection to be low enough overhead, and then we can go down to 
whatever shorter locks we want; lower the overhead, faster the event we 
can measure.


Sometimes the database will never be able to instrument some of its 
fastest events without blowing away the event itself.  We'll still have 
perf / dtrace / systemtap / etc. for those jobs.  But those are not the 
problems of the average Postgres DBA's typical day.


The data people need to solve this sort of thing in production can't 
always show up in counters.  You'll get evidence the problem is there, 
but you need more details to actually find the culprit.  Some info about 
the type of lock, tables and processes involved, maybe the query that's 
running, that sort of thing.  You can kind of half-ass the job if you 
make per-tables counter for everything, but we really need more, both to 
serve our users and to compare well against what other databases provide 
for tools.  That's why I was trying to get the infrastructure to capture 
all that lock detail, without going through the existing logging system 
first.


Actually building Performance Events fell apart on the storage side:  
figuring out where to put it all without waiting for a log file to hit 
disk.  I wanted in-memory storage so clients don't wait for anything, 
then a potentially lossy persistence writer.  I thought I could get away 
with a fixed size buffer like pg_stat_statements uses.  That was 
optimistic.  Trying to do better got me lost in memory management land 
without making much progress.


I think the work you've now done on dynamic shared memory gives the 
right shape of infrastructure that I could pull this off now.  I even 
have funding to work on it again, and it's actually the #2 thing I'd 
like to take on as I get energy for new feature development.  (#1 is the 
simple but time consuming job of adding block write counters, the lack 
of which which is just killing me on some fast growing installs)


I have a lot of unread messages on this list to sort through right now.  
I know I saw someone try to revive the idea of saving new sorts of 
performance log data again recently; can't seem to find it again right 
now.  That didn't seem like it went any farther than thinking about the 
specifications though.  The last time I jumped right over that and hit a 
wall with this one hard part of the implementation instead, low overhead 
memory management for saving everything.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-22 Thread Amit Kapila
On Mon, Sep 22, 2014 at 10:43 AM, Gregory Smith gregsmithpg...@gmail.com
wrote:

 On 9/16/14, 8:18 AM, Amit Kapila wrote:

 I think the main reason for slight difference is that
 when the size of shared buffers is almost same as data size, the number
 of buffers it needs from clock sweep are very less, as an example in first
 case (when size of shared buffers is 12286MB), it actually needs at most
 256 additional buffers (2MB) via clock sweep, where as bgreclaimer
 will put 2000 (high water mark) additional buffers (0.5% of shared buffers
 is greater than 2000 ) in free list, so bgreclaimer does some extra work
 when it is not required

 This is exactly what I was warning about, as the sort of lesson learned
 from the last round of such tuning.  There are going to be spots where
 trying to tune the code to be aggressive on the hard cases will work
 great.  But you need to make that dynamic to some degree, such that the
 code doesn't waste a lot of time sweeping buffers when the demand for them
 is actually weak.  That will make all sorts of cases that look like this
 slower.


To verify whether above can lead to any kind of regression, I have
checked the cases (workload is 0.05 or 0.1 percent larger than shared
buffers) where we need few extra buffers and bgreclaimer might put
some additional buffers and it turns out that in those cases also, there
is a win especially at high concurrency and results of the same are posted
upthread
(
http://www.postgresql.org/message-id/CAA4eK1LFGcvzMdcD5NZx7B2gCbP1G7vWK7w32EZk=voolud...@mail.gmail.com).



 We should be able to tell these apart if there's enough instrumentation
 and solid logic inside of the program itself though.  The 8.3 era BGW coped
 with a lot of these issues using a particular style of moving average with
 fast reaction time, plus instrumenting the buffer allocation rate as
 accurately as it could. So before getting into high/low water note
 questions, are you comfortable that there's a clear, accurate number that
 measures the activity level that's important here?



Very Good Question.  This was exactly the thing which was
missing in my initial versions (about 2 years back when I tried to
solve this problem) but based on Robert's and Andres's feedback
I realized that we need an accurate number to measure the activity
level (in this case it is consumption of buffers from freelist), so
I have introduced the logic to calculate the same (it is stored in new
variable numFreeListBuffers in BufferStrategyControl structure).



 And have you considered ways it might be averaging over time or have a
 history that's analyzed?



The current logic of bgreclaimer is such that even if it does
some extra activity (extra is very much controlled) in one cycle,
it will not start another cycle unless backends consume all the
buffers that were made available by bgreclaimer in one cycle.
I think the algorithm designed for bgreclaimer automatically
averages out based on activity.  Do you see any cases where it
will not do so?


 The exact fast approach / slow decay weighted moving average approach of
 the 8.3 BGW, the thing that tried to smooth the erratic data set possible
 here, was a pretty critical part of getting itself auto-tuning to workload
 size.  It ended up being much more important than the work of setting the
 arbitrary watermark levels.


Agreed, but the logic with which bgwriter works is pretty different
and thats why it needs different kind of logic to handle auto-tuning.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-21 Thread Gregory Smith

On 9/16/14, 8:18 AM, Amit Kapila wrote:

I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required
This is exactly what I was warning about, as the sort of lesson learned 
from the last round of such tuning.  There are going to be spots where 
trying to tune the code to be aggressive on the hard cases will work 
great.  But you need to make that dynamic to some degree, such that the 
code doesn't waste a lot of time sweeping buffers when the demand for 
them is actually weak.  That will make all sorts of cases that look like 
this slower.


We should be able to tell these apart if there's enough instrumentation 
and solid logic inside of the program itself though.  The 8.3 era BGW 
coped with a lot of these issues using a particular style of moving 
average with fast reaction time, plus instrumenting the buffer 
allocation rate as accurately as it could. So before getting into 
high/low water note questions, are you comfortable that there's a clear, 
accurate number that measures the activity level that's important here?  
And have you considered ways it might be averaging over time or have a 
history that's analyzed? The exact fast approach / slow decay weighted 
moving average approach of the 8.3 BGW, the thing that tried to smooth 
the erratic data set possible here, was a pretty critical part of 
getting itself auto-tuning to workload size.  It ended up being much 
more important than the work of setting the arbitrary watermark levels.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-19 Thread Amit Kapila
On Tue, Sep 16, 2014 at 10:21 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:

 In most cases performance with patch is slightly less as compare
 to HEAD and the difference is generally less than 1% and in a case
 or 2 close to 2%. I think the main reason for slight difference is that
 when the size of shared buffers is almost same as data size, the number
 of buffers it needs from clock sweep are very less, as an example in first
 case (when size of shared buffers is 12286MB), it actually needs at most
 256 additional buffers (2MB) via clock sweep, where as bgreclaimer
 will put 2000 (high water mark) additional buffers (0.5% of shared buffers
 is greater than 2000 ) in free list, so bgreclaimer does some extra work
 when it is not required and it also leads to condition you mentioned
 down (freelist will contain buffers that have already been touched since
 we added them).  Now for case 2 (12166MB), we need buffers more
 than 2000 additional buffers, but not too many, so it can also have
 similar effect.


 So there are two suboptimal things that can happen and they pull in
 opposite directions.  I think you should instrument the server how often
 each is happening.  #1 is that we can pop a buffer from the freelist and
 find that it's been touched.  That means we wasted the effort of putting it
 on the freelist in the first place.  #2 is that we can want to pop a buffer
 from the freelist and find it empty and thus be forced to run the clock
 sweep ourselves.   If we're having problem #1, we could improve things by
 reducing the water marks.  If we're having problem #2, we could improve
 things by increasing the water marks.  If we're having both problems, then
 I dunno.  But let's get some numbers on the frequency of these specific
 things, rather than just overall tps numbers.


Specific numbers of both the configurations for which I have
posted data in previous mail are as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)
Client and Thread Count = 64
buffers_touched_freelist - count of buffers that backends found touched
after
popping from freelist.
buffers_backend_clocksweep - count of buffer allocations not satisfied from
freelist

  buffers_alloc 1531023  buffers_backend_clocksweep 0
buffers_touched_freelist 0


Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)
Client and Thread Count = 64

  buffers_alloc 1531010  buffers_backend_clocksweep 0
buffers_touched_freelist 0

In both the above cases, I have taken data multiple times to ensure
correctness.  From the above data, it is evident that in both the above
configurations all the requests are satisfied from the initial freelist.
Basically the amount of shared buffers configured
(12286MB = 1572608 buffers and 12166MB = 1557248 buffers) are
sufficient to contain all the work load for pgbench run.

So now the question is why we are seeing small variation (1%) in data
in case all the data fits in shared buffers and the reason could be that
we have added few extra instructions (due to increase in StrategyControl
structure size, additional function call, one or two new assignments) in the
Buffer Allocation path (the extra instructions will also be only till all
the data
pages gets associated with buffers, after that the control won't even reach
StrategyGetBuffer()) or it may be due to variation across different runs
with
different binaries.

I have went ahead to take the data in cases shared buffers are tiny bit
(0.1%
and .05%) less than workload (based on buffer allocations done in above
cases).

Performance Data
---


Scale Factor - 800
Shared_Buffers - 11950MB

  Client_Count/Patch_Ver 8 16 32 64 128  HEAD 68424 132540 195496 279511
283280  sbe_v9 68565 132709 194631 284351 289333

Scale Factor - 800
Shared_Buffers - 11955MB

  Client_Count/Patch_Ver 8 16 32 64 128  HEAD 68331 127752 196385 274387
281753  sbe_v9 68922 131314 194452 284292 287221

The above data indicates that performance is better with patch
in almost all cases and especially at high concurrency (64 and
128 client count).

The overall conclusion is that with patch
a. when the data can fit in RAM and not completely in shared buffers,
the performance/scalability is quite good even if shared buffers are just
tiny bit less that all the data.
b. when shared buffers are sufficient to contain all the data, then there is
a slight difference (1%) in performance.



 d. Lets not do anything as if user does such a configuration, he should
 be educated to configure shared buffers in a better way and or the
 performance hit doesn't seem to be justified to do any further
 work.


 At least worth entertaining.

 Based on further analysis, I think this is the way to go.

Attached find the patch for new stat (buffers_touched_freelist) just in
case you want to run the patch with it and detailed (individual run)
performance data.


With Regards,

Re: [HACKERS] Scaling shared buffer eviction

2014-09-16 Thread Amit Kapila
On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Fri, Sep 12, 2014 at 11:55 AM, Amit Chapel amit.kapil...@gmail.com
 wrote:
  On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com
 wrote:
   On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:

 I will post the data with the latest patch separately (where I will focus
 on new cases discussed between Robert and Andres).


Performance Data with latest version of patch.
All the data shown below is a median of 3 runs, for each
individual run data, refer attached document
(perf_read_scalability_data_v9.ods)

 Performance Data for Read-only test
 -
 Configuration and Db Details
 IBM POWER-7 16 cores, 64 hardware threads
 RAM = 64GB
 Database Locale =C
 checkpoint_segments=256
 checkpoint_timeout=15min
 shared_buffers=8GB
 scale factor = 3000
 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
 Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

   Client_Count/Patch_Ver 8 16 32 64 128  HEAD 58614 107370 140717 104357
65010  sbe_v9 62943 119064 172246 220174 220904


Observations
-
1. It scales well as with previous versions of patch, but
it seems the performance is slightly better in few cases,
may be because I have removed a statement (if check)
or 2 in bgreclaimer (those were done under spinlock) or it
could be just run-to-run difference.

 (1) A read-only pgbench workload that is just a tiny bit larger than
 shared_buffers, say size of shared_buffers plus 0.01%.  Such workloads
 tend to stress buffer eviction heavily.

When the data is just tiny bit larger than shared buffers, actually
there is no problem in scalability even in HEAD, because I think
most of the requests will be satisfied from existing buffer pool.
I have taken data for some of the loads where database size is
bit larger than shared buffers and it is as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)


   Client_Count/Patch_Ver 1 8 16 32 64 128  HEAD 8406 68712 13 198481
290340 289828  sbe_v9 8504 68546 131926 195789 289959 289021

Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)

   Client_Count/Patch_Ver 1 8 16 32 64 128  HEAD 8428 68609 128092 196596
292066 293812  sbe_v9 8386 68546 126926 197126 289959 287621

Observations
-
In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them).  Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

I think we have below options related to this observation
a. Some further tuning in bgreclaimer, so that instead of putting
the buffers up to high water mark in freelist, it puts just 1/4th or
1/2 of high water mark and then check if the free list still contains
lesser than equal to low water mark, if yes it continues and if not
then it can wait (or may be some other way).
b. Instead of waking bgreclaimer when the number of buffers fall
below low water mark, wake when the number of times backends
does clock sweep crosses certain threshold
c. Give low and high water mark as config knobs, so that in some
rare cases users can use them to do tuning.
d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

Now if we do either of 'a' or 'b', then I think there is a chance
that the gain might not be same for cases where users can
easily get benefit from this patch and there is a chance that
it degrades the performance in some other case.

 (2) A workload that maximizes the rate of concurrent buffer eviction
 relative to other tasks.  Read-only pgbench is not bad for this, but
 maybe somebody's got a better idea.

I think the first test of pgbench (scale_factor-3000;shared_buffers-8GB)
addresses this case.

 As I sort of mentioned in what I was writing for the bufmgr README,
 there are, more or less, three ways this can fall down, at least that
 I can see: (1) if the high water mark is too high, then we'll start
 finding buffers in the freelist 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-16 Thread Robert Haas
On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 In most cases performance with patch is slightly less as compare
 to HEAD and the difference is generally less than 1% and in a case
 or 2 close to 2%. I think the main reason for slight difference is that
 when the size of shared buffers is almost same as data size, the number
 of buffers it needs from clock sweep are very less, as an example in first
 case (when size of shared buffers is 12286MB), it actually needs at most
 256 additional buffers (2MB) via clock sweep, where as bgreclaimer
 will put 2000 (high water mark) additional buffers (0.5% of shared buffers
 is greater than 2000 ) in free list, so bgreclaimer does some extra work
 when it is not required and it also leads to condition you mentioned
 down (freelist will contain buffers that have already been touched since
 we added them).  Now for case 2 (12166MB), we need buffers more
 than 2000 additional buffers, but not too many, so it can also have
 similar effect.


So there are two suboptimal things that can happen and they pull in
opposite directions.  I think you should instrument the server how often
each is happening.  #1 is that we can pop a buffer from the freelist and
find that it's been touched.  That means we wasted the effort of putting it
on the freelist in the first place.  #2 is that we can want to pop a buffer
from the freelist and find it empty and thus be forced to run the clock
sweep ourselves.   If we're having problem #1, we could improve things by
reducing the water marks.  If we're having problem #2, we could improve
things by increasing the water marks.  If we're having both problems, then
I dunno.  But let's get some numbers on the frequency of these specific
things, rather than just overall tps numbers.


 I think we have below options related to this observation
 a. Some further tuning in bgreclaimer, so that instead of putting
 the buffers up to high water mark in freelist, it puts just 1/4th or
 1/2 of high water mark and then check if the free list still contains
 lesser than equal to low water mark, if yes it continues and if not
 then it can wait (or may be some other way).


That sounds suspiciously like just reducing the high water mark.


 b. Instead of waking bgreclaimer when the number of buffers fall
 below low water mark, wake when the number of times backends
 does clock sweep crosses certain threshold


That doesn't sound helpful.


 c. Give low and high water mark as config knobs, so that in some
 rare cases users can use them to do tuning.


Yuck.


 d. Lets not do anything as if user does such a configuration, he should
 be educated to configure shared buffers in a better way and or the
 performance hit doesn't seem to be justified to do any further
 work.


At least worth entertaining.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Scaling shared buffer eviction

2014-09-14 Thread Amit Kapila
On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
 On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com
wrote:
  On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
   +++ b/src/backend/postmaster/bgreclaimer.c
 
  A fair number of comments in that file refer to bgwriter...

 Will fix.

Fixed.

   @@ -0,0 +1,302 @@
  
+/*-
   + *
   + * bgreclaimer.c
   + *
   + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.
 It
   + * attempts to keep regular backends from having to run clock sweep
(which
   + * they would only do when they don't find a usable shared buffer
from
   + * freelist to read in another page).
 
  That's not really accurate. Freelist pages are often also needed to
  write new pages, without reading anything in.

 Agreed, but the same is used in bgwriter file as well; so if we
 change here, we might want to change bgwriter file header as well.

I have just changed bgreclaimer for this comment, if the same
is required for bgwriter, I can create a separate patch as that
change is not related to this patch, so I thought it is better
to keep it separate.

  I'd phrase it as which
  they only need to do if they don't find a victim buffer from the
  freelist

 victim buffer sounds more like a buffer which it will get from
 clock sweep, how about next candidate (same is used in function
 header of StrategyGetBuffer()).

Fixed.

In the best scenario all requests
   + * for shared buffers will be fulfilled from freelist as the
background
   + * reclaimer process always tries to maintain buffers on freelist.
 However,
   + * regular backends are still empowered to run clock sweep to find a
usable
   + * buffer if the bgreclaimer fails to maintain enough buffers on
freelist.
 
  empowered sounds strange to me. 'still can run the clock sweep'?

 No harm in changing like what you are suggesting, however the same is
 used in file header of bgwriter.c as well, so I think lets keep this
usage as
 it is because there is no correctness issue here.

Not changed anything for this comment.



 
   +static void bgreclaim_quickdie(SIGNAL_ARGS);
   +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
   +static void ReqShutdownHandler(SIGNAL_ARGS);
   +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
 
  This looks inconsistent.

 I have kept based on bgwriter, so not sure if it's good to change.
 However I we want consistent in naming, I would like to keep
 something like:

 ReclaimShutdownHandler
 ReclaimQuickDieHandler
 ..
 ..

Changed function names to make them consistent.

   + /*
   +  * If an exception is encountered, processing resumes here.
   +  *
   +  * See notes in postgres.c about the design of this coding.
   +  */
   + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
   + {
   + /* Since not using PG_TRY, must reset error stack by
hand */
   + error_context_stack = NULL;
 ..
 
  No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
  good idea, regardless of it possibly being true today (which I'm not
  sure about yet).

 I will add LWLockReleaseAll() in exception handling as discussed
 elsewhere in thread.

Done.


   +
   + /* Now we can allow interrupts again */
   + RESUME_INTERRUPTS();
 
  Other processes sleep for a second here, I think that's a good
  idea. E.g. that bit:

 Agreed, will make change as per suggestion.

Done.

 
   + /*
   +  * Loop forever
   +  */
   + for (;;)
   + {
   + int rc;
   +
   +
   + /*
   +  * Backend will signal bgreclaimer when the number of
buffers in
   +  * freelist falls below than low water mark of freelist.
   +  */
   + rc = WaitLatch(MyProc-procLatch,
   +WL_LATCH_SET |
WL_POSTMASTER_DEATH,
   +-1);
 
  That's probably not going to work well directly after a (re)start of
  bgreclaim (depending on how you handle the water mark, I'll see in a
  bit).

 Could you please be more specific here?

I wasn't sure if any change is required here, so kept the code
as it is.

 
   +Background Reclaimer's Processing
   +-
 ..
   +Two water mark indicators are used to maintain sufficient number of
buffers
   +on freelist.  Low water mark indicator is used by backends to wake
bgreclaimer
   +when the number of buffers in freelist falls below it.  High water
mark
   +indicator is used by bgreclaimer to move buffers to freelist.
 
  For me the description of the high water as stated here doesn't seem to
  explain anything.
 
  This section should have a description of how the reclaimer interacts
  with the bgwriter logic. Do we put dirty buffers on the freelist that
  are then cleaned by the bgwriter? Which buffers does the bgwriter 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-14 Thread Amit Kapila
On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith gregsmithpg...@gmail.com
wrote:

 This looks like it's squashed one of the very fundamental buffer
  scaling issues though; well done Amit.

Thanks.

 I'll go back to my notes and try to recreate the pathological cases
 that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync
 spreading effort I did; get those running again and see how they
 do on this new approach.  I have a decent sized 24 core server
 that should be good enough for this job.  I'll see what I can do.

It will be really helpful if you can try out those cases.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-14 Thread Mark Kirkwood

On 14/09/14 19:00, Amit Kapila wrote:

On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith
gregsmithpg...@gmail.com mailto:gregsmithpg...@gmail.com wrote:

  This looks like it's squashed one of the very fundamental buffer
   scaling issues though; well done Amit.

Thanks.

  I'll go back to my notes and try to recreate the pathological cases
  that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync
  spreading effort I did; get those running again and see how they
  do on this new approach.  I have a decent sized 24 core server
  that should be good enough for this job.  I'll see what I can do.

It will be really helpful if you can try out those cases.




And if you want 'em run on the 60 core beast, just let me know the 
details and I'll do some runs for you.


Cheers

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Amit Kapila
On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
  +++ b/src/backend/postmaster/bgreclaimer.c

 A fair number of comments in that file refer to bgwriter...

Will fix.

  @@ -0,0 +1,302 @@
 
+/*-
  + *
  + * bgreclaimer.c
  + *
  + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.
 It
  + * attempts to keep regular backends from having to run clock sweep
(which
  + * they would only do when they don't find a usable shared buffer from
  + * freelist to read in another page).

 That's not really accurate. Freelist pages are often also needed to
 write new pages, without reading anything in.

Agreed, but the same is used in bgwriter file as well; so if we
change here, we might want to change bgwriter file header as well.

 I'd phrase it as which
 they only need to do if they don't find a victim buffer from the
 freelist

victim buffer sounds more like a buffer which it will get from
clock sweep, how about next candidate (same is used in function
header of StrategyGetBuffer()).

   In the best scenario all requests
  + * for shared buffers will be fulfilled from freelist as the background
  + * reclaimer process always tries to maintain buffers on freelist.
 However,
  + * regular backends are still empowered to run clock sweep to find a
usable
  + * buffer if the bgreclaimer fails to maintain enough buffers on
freelist.

 empowered sounds strange to me. 'still can run the clock sweep'?

No harm in changing like what you are suggesting, however the same is
used in file header of bgwriter.c as well, so I think lets keep this usage
as
it is because there is no correctness issue here.

  + * The bgwriter is started by the postmaster as soon as the startup
subprocess
  + * finishes, or as soon as recovery begins if we are doing archive
recovery.

 Why only archive recovery? I guess (only read this far...) it's not just
 during InArchiveRecoveryb recovery but also StandbyMode?

It will be for both.

 But I don't see
 why we shouldn't use it during normal crash recovery. That's also often
 painfully slow and the reclaimer could help? Less, but still.

Yes, it might improve a bit, however the main benefit with this patch is
under heavy load which means that it mainly addresses contention and
in case of crash recovery there will not be any contention because there
will be no backend processes.

Also I think to enable bgreclaimer during crash recovery, we need to have
one new signal which is not a problem, but it will make more sense to enable
it later based on benefit if it is there.


  +static void bgreclaim_quickdie(SIGNAL_ARGS);
  +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
  +static void ReqShutdownHandler(SIGNAL_ARGS);
  +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

 This looks inconsistent.

I have kept based on bgwriter, so not sure if it's good to change.
However I we want consistent in naming, I would like to keep
something like:

ReclaimShutdownHandler
ReclaimQuickDieHandler
..
..

  + /*
  +  * If an exception is encountered, processing resumes here.
  +  *
  +  * See notes in postgres.c about the design of this coding.
  +  */
  + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
  + {
  + /* Since not using PG_TRY, must reset error stack by hand
*/
  + error_context_stack = NULL;
..

 No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
 good idea, regardless of it possibly being true today (which I'm not
 sure about yet).

I will add LWLockReleaseAll() in exception handling as discussed
elsewhere in thread.


  +
  + /* Now we can allow interrupts again */
  + RESUME_INTERRUPTS();

 Other processes sleep for a second here, I think that's a good
 idea. E.g. that bit:

Agreed, will make change as per suggestion.


  + /*
  +  * Loop forever
  +  */
  + for (;;)
  + {
  + int rc;
  +
  +
  + /*
  +  * Backend will signal bgreclaimer when the number of
buffers in
  +  * freelist falls below than low water mark of freelist.
  +  */
  + rc = WaitLatch(MyProc-procLatch,
  +WL_LATCH_SET |
WL_POSTMASTER_DEATH,
  +-1);

 That's probably not going to work well directly after a (re)start of
 bgreclaim (depending on how you handle the water mark, I'll see in a
 bit).

Could you please be more specific here?


  +Background Reclaimer's Processing
  +-
..
  +Two water mark indicators are used to maintain sufficient number of
buffers
  +on freelist.  Low water mark indicator is used by backends to wake
bgreclaimer
  +when the number of buffers in freelist falls below it.  High water mark
  +indicator is used by bgreclaimer to move 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Ants Aasma
On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote:
  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.

 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

 Well, this in theory increases the number of processes touching buffer
 headers regularly. Currently, if you have one read IO intensive backend,
 there's pretty much only process touching the cachelines. This will make
 it two. I don't think it's unreasonable to try to reduce the cacheline
 pingpong caused by that...

I don't think it will help much. A pinned buffer is pretty likely to
be in modified state in the cache of the cpu of the pinning backend.
Even taking a look at the refcount will trigger a writeback and
demotion to shared state. When time comes to unpin the buffer the
cacheline must again be promoted to exclusive state introducing
coherency traffic. Not locking the buffer only saves transfering the
cacheline back to the pinning backend, not a huge amount of savings.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Andres Freund
On 2014-09-12 12:38:48 +0300, Ants Aasma wrote:
 On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote:
   Hm. Perhaps we should do a bufHdr-refcount != zero check without
   locking here? The atomic op will transfer the cacheline exclusively to
   the reclaimer's CPU. Even though it very shortly afterwards will be
   touched afterwards by the pinning backend.
 
  Meh.  I'm not in favor of adding more funny games with locking unless
  we can prove they're necessary for performance.
 
  Well, this in theory increases the number of processes touching buffer
  headers regularly. Currently, if you have one read IO intensive backend,
  there's pretty much only process touching the cachelines. This will make
  it two. I don't think it's unreasonable to try to reduce the cacheline
  pingpong caused by that...
 
 I don't think it will help much. A pinned buffer is pretty likely to
 be in modified state in the cache of the cpu of the pinning backend.

Right. Unless you're on a MOESI platforms. I'd really like to know why
that's not more widely used.

 Even taking a look at the refcount will trigger a writeback and
 demotion to shared state.  When time comes to unpin the buffer the
 cacheline must again be promoted to exclusive state introducing
 coherency traffic. Not locking the buffer only saves transfering the
 cacheline back to the pinning backend, not a huge amount of savings.

Yes. But: In many, if not most, cases the cacheline will be read a
couple times before modifying it via the spinlock.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Gregory Smith

On 9/11/14, 7:01 AM, Andres Freund wrote:
I'm not convinced that these changes can be made without also changing 
the bgwriter logic. Have you measured whether there are differences in 
how effective the bgwriter is? Not that it's very effective right now :)


The current background writer tuning went out of its way to do nothing 
when it wasn't clear there was something that always worked.  What 
happened with all of the really clever schemes was that they worked on 
some workloads, and just trashed others.  Most of the gain from the 8.3 
rewrite came from looking at well theorized ideas for how to handle 
things like pre-emptive LRU scanning for writes, and just throwing them 
out altogether in favor of ignoring the problem.  The magic numbers left 
in or added to the code  were tuned to do very little work, 
intentionally.  If anything, since then the pressure to do nothing has 
gone up in the default install, because now people are very concerned 
about extra wakeups using power.


To find bad cases before, I was running about 30 different test 
combinations by the end, Heikki was running another set in the EDB lab, 
I believe there was a lab at NTT running their own set too. What went in 
was the combination that didn't break any of them badly--not the one 
that performed best on the good workloads.


This looks like it's squashed one of the very fundamental buffer scaling 
issues though; well done Amit.  What I'd like to see is preserving the 
heart of that while touching as little as possible. When in doubt, do 
nothing; let the backends suck it up and do the work themselves.


I had to take a health break from community development for a while, and 
I'm hoping to jump back into review again for the rest of the current 
development cycle.  I'll go back to my notes and try to recreate the 
pathological cases that plagued both the 8.3 BGW rewrite and the aborted 
9.2 fsync spreading effort I did; get those running again and see how 
they do on this new approach.  I have a decent sized 24 core server that 
should be good enough for this job.  I'll see what I can do.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Andres Freund
Hi,

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
  include $(top_srcdir)/src/backend/common.mk
 diff --git a/src/backend/postmaster/bgreclaimer.c 
 b/src/backend/postmaster/bgreclaimer.c
 new file mode 100644
 index 000..3df2337
 --- /dev/null
 +++ b/src/backend/postmaster/bgreclaimer.c

A fair number of comments in that file refer to bgwriter...

 @@ -0,0 +1,302 @@
 +/*-
 + *
 + * bgreclaimer.c
 + *
 + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
 + * attempts to keep regular backends from having to run clock sweep (which
 + * they would only do when they don't find a usable shared buffer from
 + * freelist to read in another page).

That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in. I'd phrase it as which
they only need to do if they don't find a victim buffer from the
freelist

  In the best scenario all requests
 + * for shared buffers will be fulfilled from freelist as the background
 + * reclaimer process always tries to maintain buffers on freelist.  However,
 + * regular backends are still empowered to run clock sweep to find a usable
 + * buffer if the bgreclaimer fails to maintain enough buffers on freelist.

empowered sounds strange to me. 'still can run the clock sweep'?

 + * The bgwriter is started by the postmaster as soon as the startup 
 subprocess
 + * finishes, or as soon as recovery begins if we are doing archive recovery.

Why only archive recovery? I guess (only read this far...) it's not just
during InArchiveRecoveryb recovery but also StandbyMode? But I don't see
why we shouldn't use it during normal crash recovery. That's also often
painfully slow and the reclaimer could help? Less, but still.


 +static void bgreclaim_quickdie(SIGNAL_ARGS);
 +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
 +static void ReqShutdownHandler(SIGNAL_ARGS);
 +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

 + /*
 +  * If an exception is encountered, processing resumes here.
 +  *
 +  * See notes in postgres.c about the design of this coding.
 +  */
 + if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 + {
 + /* Since not using PG_TRY, must reset error stack by hand */
 + error_context_stack = NULL;
 +
 + /* Prevent interrupts while cleaning up */
 + HOLD_INTERRUPTS();
 +
 + /* Report the error to the server log */
 + EmitErrorReport();
 +
 + /*
 +  * These operations are really just a minimal subset of
 +  * AbortTransaction().  We don't have very many resources to 
 worry
 +  * about in bgreclaim, but we do have buffers and file 
 descriptors.
 +  */
 + UnlockBuffers();
 + AtEOXact_Buffers(false);
 + AtEOXact_Files();

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

 + /*
 +  * Now return to normal top-level context and clear 
 ErrorContext for
 +  * next time.
 +  */
 + MemoryContextSwitchTo(bgreclaim_context);
 + FlushErrorState();
 +
 + /* Flush any leaked data in the top-level context */
 + MemoryContextResetAndDeleteChildren(bgreclaim_context);
 +
 + /* Now we can allow interrupts again */
 + RESUME_INTERRUPTS();

Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:
/*
 * Sleep at least 1 second after any error.  A write error is 
likely
 * to be repeated, and we don't want to be filling the error 
logs as
 * fast as we can.
 */
pg_usleep(100L);


 + /*
 +  * Loop forever
 +  */
 + for (;;)
 + {
 + int rc;
 +
 +
 + /*
 +  * Backend will signal bgreclaimer when the number of buffers in
 +  * freelist falls below than low water mark of freelist.
 +  */
 + rc = WaitLatch(MyProc-procLatch,
 +WL_LATCH_SET | WL_POSTMASTER_DEATH,
 +-1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit). Maybe it should rather be

ResetLatch();
BgMoveBuffersToFreelist();
pgstat_send_bgwriter();
rc = WaitLatch()
if (rc  WL_POSTMASTER_DEATH)
   exit(1)

 +Background Reclaimer's Processing
 +-
 +
 +The background reclaimer is designed to move buffers to freelist that are
 +likely to be recycled soon, thereby offloading the need to perform
 +clock sweep work from active 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Robert Haas
Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote:
 +static void bgreclaim_quickdie(SIGNAL_ARGS);
 +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
 +static void ReqShutdownHandler(SIGNAL_ARGS);
 +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

 This looks inconsistent.

It's exactly the same as what bgwriter.c does.

 No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
 good idea, regardless of it possibly being true today (which I'm not
 sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes.  The current state of affairs is really pretty
helter-skelter.  But for this patch, I think we should aim to mimic
the existing style, as ugly as it is.  I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

 +Two water mark indicators are used to maintain sufficient number of buffers
 +on freelist.  Low water mark indicator is used by backends to wake 
 bgreclaimer
 +when the number of buffers in freelist falls below it.  High water mark
 +indicator is used by bgreclaimer to move buffers to freelist.

 For me the description of the high water as stated here doesn't seem to
 explain anything.

Yeah, let me try to revise and expand on that a bit:

Background Reclaimer's Processing
-

The background reclaimer runs the clock sweep to identify buffers that
are good candidates for eviction and puts them on the freelist.  This
makes buffer allocation much faster, since removing a buffer from the
head of a linked list is much cheaper than linearly scanning the whole
buffer pool until a promising candidate is found.  It's possible that
a buffer we add to the freelist may be accessed or even pinned before
it's evicted; if that happens, the backend that would have evicted it
will simply disregard it and take the next buffer instead (or run the
clock sweep itself, if necessary).  However, to make sure that doesn't
happen too often, we need to keep the freelist as short as possible,
so that there won't be many other buffer accesses between when the
time a buffer is added to the freelist and the time when it's actually
evicted.

We use two water marks to control the activity of the bgreclaimer
process.  Each time bgreclaimer is awoken, it will move buffers to the
freelist until the length of the free list reaches the high water
mark.  It will then sleep.  When the number of buffers on the freelist
reaches the low water mark, backends attempting to allocate new
buffers will set the bgreclaimer's latch, waking it up again.  While
it's important for the high water mark to be small (for the reasons
described above), we also need to ensure adequate separation between
the low and high water marks, so that the bgreclaimer isn't constantly
being awoken to find just a handful of additional candidate buffers,
and we need to ensure that the low watermark is adequate to keep the
freelist from becoming completely empty before bgreclaimer has time to
wake up and beginning filling it again.

 This section should have a description of how the reclaimer interacts
 with the bgwriter logic. Do we put dirty buffers on the freelist that
 are then cleaned by the bgwriter? Which buffers does the bgwriter write
 out?

The bgwriter is cleaning ahead of the strategy point, and the
bgreclaimer is advancing the strategy point.  I think we should
consider having the bgreclaimer wake the bgwriter if it comes across a
dirty buffer, because while the bgwriter only estimates the rate of
buffer allocation, bgreclaimer *knows* the rate of allocation, because
its own activity is tied to the allocation rate.  I think there's the
potential for this kind of thing to make the background writer
significantly more effective than it is today, but I'm heavily in
favor of leaving it for a separate patch.

 I wonder if we don't want to increase the high watermark when
 tmp_recent_backend_clocksweep  0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

 Hm. Perhaps we should do a bufHdr-refcount != zero check without
 locking here? The atomic op will transfer the cacheline exclusively to
 the reclaimer's CPU. Even though it very shortly afterwards will be
 touched afterwards by the pinning backend.

Meh.  I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

 * Are we sure that the freelist_lck spinlock won't cause pain? Right now
   there will possibly be dozens of processes busily spinning on it... I
   think it's a acceptable risk, but we should think about it.

As you and I have talked about before, we could reduce contention here
by partitioning the freelist, or by using a CAS loop to pop items off
of it.  But I am not convinced either 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Amit Kapila
On Thu, Sep 11, 2014 at 6:32 PM, Robert Haas robertmh...@gmail.com wrote:

 Thanks for reviewing, Andres.

 On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com
wrote:
  +static void bgreclaim_quickdie(SIGNAL_ARGS);
  +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
  +static void ReqShutdownHandler(SIGNAL_ARGS);
  +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
 
  This looks inconsistent.

 It's exactly the same as what bgwriter.c does.

  No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
  good idea, regardless of it possibly being true today (which I'm not
  sure about yet).

 We really need a more centralized way to handle error cleanup in
 auxiliary processes.  The current state of affairs is really pretty
 helter-skelter.  But for this patch, I think we should aim to mimic
 the existing style, as ugly as it is.  I'm not sure whether Amit's got
 the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
 is probably a good idea.

Code related to bgreclaimer logic itself doesn't take any LWLock, do
you suspect the same might be required due to some Signal/Interrupt
handling?

From myside, I have thought about what to keep for error cleanup based
on the working of bgreclaimer.  However there is a chance that I have
missed something.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Andres Freund
On 2014-09-11 09:02:34 -0400, Robert Haas wrote:
 Thanks for reviewing, Andres.
 
 On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund and...@2ndquadrant.com wrote:
  +static void bgreclaim_quickdie(SIGNAL_ARGS);
  +static void BgreclaimSigHupHandler(SIGNAL_ARGS);
  +static void ReqShutdownHandler(SIGNAL_ARGS);
  +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
 
  This looks inconsistent.
 
 It's exactly the same as what bgwriter.c does.

So what? There's no code in common, so I see no reason to have one
signal handler using underscores and the next one camelcase names.

  No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
  good idea, regardless of it possibly being true today (which I'm not
  sure about yet).
 
 We really need a more centralized way to handle error cleanup in
 auxiliary processes.  The current state of affairs is really pretty
 helter-skelter.

Agreed. There really should be three variants:
* full abort including support for transactions
* full abort without transactions being used (most background processes)
* abort without shared memory interactions

 But for this patch, I think we should aim to mimic
 the existing style, as ugly as it is.

Agreed.

 Background Reclaimer's Processing
 -
 
 The background reclaimer runs the clock sweep to identify buffers that
 are good candidates for eviction and puts them on the freelist.  This
 makes buffer allocation much faster, since removing a buffer from the
 head of a linked list is much cheaper than linearly scanning the whole
 buffer pool until a promising candidate is found.  It's possible that
 a buffer we add to the freelist may be accessed or even pinned before
 it's evicted; if that happens, the backend that would have evicted it
 will simply disregard it and take the next buffer instead (or run the
 clock sweep itself, if necessary).  However, to make sure that doesn't
 happen too often, we need to keep the freelist as short as possible,
 so that there won't be many other buffer accesses between when the
 time a buffer is added to the freelist and the time when it's actually
 evicted.
 
 We use two water marks to control the activity of the bgreclaimer
 process.  Each time bgreclaimer is awoken, it will move buffers to the
 freelist until the length of the free list reaches the high water
 mark.  It will then sleep.

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

  This section should have a description of how the reclaimer interacts
  with the bgwriter logic. Do we put dirty buffers on the freelist that
  are then cleaned by the bgwriter? Which buffers does the bgwriter write
  out?
 
 The bgwriter is cleaning ahead of the strategy point, and the
 bgreclaimer is advancing the strategy point.

That sentence, in some form, should be in the above paragraph.

 I think we should
 consider having the bgreclaimer wake the bgwriter if it comes across a
 dirty buffer, because while the bgwriter only estimates the rate of
 buffer allocation, bgreclaimer *knows* the rate of allocation, because
 its own activity is tied to the allocation rate.  I think there's the
 potential for this kind of thing to make the background writer
 significantly more effective than it is today, but I'm heavily in
 favor of leaving it for a separate patch.

Yes, doing that sounds like a good plan. I'm happy with that being done
in a separate patch.

  I wonder if we don't want to increase the high watermark when
  tmp_recent_backend_clocksweep  0?
 
 That doesn't really work unless there's some countervailing force to
 eventually reduce it again; otherwise, it'd just converge to infinity.
 And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.
 
 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

  * Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.
 
 As you and I have talked 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Andres Freund
  We really need a more centralized way to handle error cleanup in
  auxiliary processes.  The current state of affairs is really pretty
  helter-skelter.  But for this patch, I think we should aim to mimic
  the existing style, as ugly as it is.  I'm not sure whether Amit's got
  the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
  is probably a good idea.
 
 Code related to bgreclaimer logic itself doesn't take any LWLock, do
 you suspect the same might be required due to some Signal/Interrupt
 handling?

I suspect it might creep in at some point at some unrelated place. Which
will only ever break in production scenarios. Say, a lwlock in in config
file processing. I seem to recall somebody seing a version of a patching
adding a lwlock there... :). Or a logging hook. Or ...

The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Amit Kapila
On Thu, Sep 11, 2014 at 6:59 PM, Andres Freund and...@2ndquadrant.com
wrote:

   We really need a more centralized way to handle error cleanup in
   auxiliary processes.  The current state of affairs is really pretty
   helter-skelter.  But for this patch, I think we should aim to mimic
   the existing style, as ugly as it is.  I'm not sure whether Amit's got
   the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
   is probably a good idea.
 
  Code related to bgreclaimer logic itself doesn't take any LWLock, do
  you suspect the same might be required due to some Signal/Interrupt
  handling?

 I suspect it might creep in at some point at some unrelated place. Which
 will only ever break in production scenarios. Say, a lwlock in in config
 file processing.

Yeah, I suspected the same and checked that path, but couldn't find but
may be in some path it is there as the code has many flows.

 I seem to recall somebody seing a version of a patching
 adding a lwlock there... :). Or a logging hook. Or ...

 The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Okay, I shall add it in next version of patch and mention in comments
the reasons quoted by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Robert Haas
On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote:
 It's exactly the same as what bgwriter.c does.

 So what? There's no code in common, so I see no reason to have one
 signal handler using underscores and the next one camelcase names.

/me shrugs.

It's not always possible to have things be consistent with each other
within a file and also with what gets done in other files.  I'm not
sure we should fault patch authors for choosing a different one than
we would have chosen.  FWIW, I probably would have done it the way
Amit did it.  I don't actually care, though.

 I wonder if we should recheck the number of freelist items before
 sleeping. As the latch currently is reset before sleeping (IIRC) we
 might miss being woken up soon. It very well might be that bgreclaim
 needs to run for more than one cycle in a row to keep up...

The outer loop in BgMoveBuffersToFreelist() was added to address
precisely this point, which I raised in a previous review.

  I wonder if we don't want to increase the high watermark when
  tmp_recent_backend_clocksweep  0?

 That doesn't really work unless there's some countervailing force to
 eventually reduce it again; otherwise, it'd just converge to infinity.
 And it doesn't really seem necessary at the moment.

 Right, it obviously needs to go both ways. I'm a bit sceptic about
 untunable, fixed, numbers proving to be accurate for widely varied
 workloads.

Me, too, but I'm *even more* skeptical about making things complicated
on the pure theory that a simple solution can't be correct.  I'm not
blind to the possibility that the current logic is inadequate, but
testing proves that it works well enough to produce a massive
performance boost over where we are now.  When, and if, we develop a
theory about specifically how it falls short then, sure, let's adjust
it.  But I think it would be a serious error to try to engineer a
perfect algorithm here based on the amount of testing that we can
reasonably do pre-commit.  We have no chance of getting that right,
and I'd rather have an algorithm that is simple and imperfect than an
algorithm that is complex and still imperfect.  No matter what, it's
better than what we have now.

  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.

 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

 Well, this in theory increases the number of processes touching buffer
 headers regularly. Currently, if you have one read IO intensive backend,
 there's pretty much only process touching the cachelines. This will make
 it two. I don't think it's unreasonable to try to reduce the cacheline
 pingpong caused by that...

It's not unreasonable, but this is a good place to apply Knuth's first
law of optimization.  There's no proof we need to do this, so let's
not until there is.

 I also think it would be good to
 get some statistics on how often regular backends are running the
 clocksweep vs. how often bgreclaimer is satisfying their needs.

 I think that's necessary. The patch added buf_backend_clocksweep. Maybe
 we just also need buf_backend_from_freelist?

That's just (or should be just) buf_alloc - buf_backend_clocksweep.

I think buf_backend_clocksweep should really be called
buf_alloc_clocksweep, and should be added (in all relevant places)
right next to buf_alloc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Andres Freund
On 2014-09-11 09:48:10 -0400, Robert Haas wrote:
 On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com wrote:
  I wonder if we should recheck the number of freelist items before
  sleeping. As the latch currently is reset before sleeping (IIRC) we
  might miss being woken up soon. It very well might be that bgreclaim
  needs to run for more than one cycle in a row to keep up...
 
 The outer loop in BgMoveBuffersToFreelist() was added to address
 precisely this point, which I raised in a previous review.

Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
up being continously busy without being visible.

   I wonder if we don't want to increase the high watermark when
   tmp_recent_backend_clocksweep  0?
 
  That doesn't really work unless there's some countervailing force to
  eventually reduce it again; otherwise, it'd just converge to infinity.
  And it doesn't really seem necessary at the moment.
 
  Right, it obviously needs to go both ways. I'm a bit sceptic about
  untunable, fixed, numbers proving to be accurate for widely varied
  workloads.
 
 Me, too, but I'm *even more* skeptical about making things complicated
 on the pure theory that a simple solution can't be correct.

Fair enough.

 I'm not
 blind to the possibility that the current logic is inadequate, but
 testing proves that it works well enough to produce a massive
 performance boost over where we are now.

But, to be honest, the testing so far was pretty narrow in the kind of
workloads that were run if I crossread things accurately. Don't get me
wrong, I'm *really* happy about having this patch, that just doesn't
mean every detail is right ;)

   Hm. Perhaps we should do a bufHdr-refcount != zero check without
   locking here? The atomic op will transfer the cacheline exclusively to
   the reclaimer's CPU. Even though it very shortly afterwards will be
   touched afterwards by the pinning backend.
 
  Meh.  I'm not in favor of adding more funny games with locking unless
  we can prove they're necessary for performance.
 
  Well, this in theory increases the number of processes touching buffer
  headers regularly. Currently, if you have one read IO intensive backend,
  there's pretty much only process touching the cachelines. This will make
  it two. I don't think it's unreasonable to try to reduce the cacheline
  pingpong caused by that...
 
 It's not unreasonable, but this is a good place to apply Knuth's first
 law of optimization.  There's no proof we need to do this, so let's
 not until there is.

That's true for new (pieces of) software; less so, when working with a
installed base that you might regress... But whatever.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-11 Thread Robert Haas
On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-11 09:48:10 -0400, Robert Haas wrote:
 On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I wonder if we should recheck the number of freelist items before
  sleeping. As the latch currently is reset before sleeping (IIRC) we
  might miss being woken up soon. It very well might be that bgreclaim
  needs to run for more than one cycle in a row to keep up...

 The outer loop in BgMoveBuffersToFreelist() was added to address
 precisely this point, which I raised in a previous review.

 Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
 ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
 up being continously busy without being visible.

Good idea.

 I'm not
 blind to the possibility that the current logic is inadequate, but
 testing proves that it works well enough to produce a massive
 performance boost over where we are now.

 But, to be honest, the testing so far was pretty narrow in the kind of
 workloads that were run if I crossread things accurately. Don't get me
 wrong, I'm *really* happy about having this patch, that just doesn't
 mean every detail is right ;)

Oh, sure.  Totally agreed.  And, to the extent that we're improving
things based on actual testing, I'm A-OK with that.  I just don't want
to start speculating, or we'll never get this thing off the ground.

Some possibly-interesting test cases would be:

(1) A read-only pgbench workload that is just a tiny bit larger than
shared_buffers, say size of shared_buffers plus 0.01%.  Such workloads
tend to stress buffer eviction heavily.

(2) A workload that maximizes the rate of concurrent buffer eviction
relative to other tasks.  Read-only pgbench is not bad for this, but
maybe somebody's got a better idea.

As I sort of mentioned in what I was writing for the bufmgr README,
there are, more or less, three ways this can fall down, at least that
I can see: (1) if the high water mark is too high, then we'll start
finding buffers in the freelist that have already been touched since
we added them: (2) if the low water mark is too low, the freelist will
run dry; and (3) if the low and high water marks are too close
together, the bgreclaimer will be constantly getting woken up and
going to sleep again.  I can't personally think of a workload that
will enable us to get a better handle on those cases than
high-concurrency pgbench, but you're known to be ingenious at coming
up with destruction workloads, so if you have an idea, by all means
fire away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Amit Kapila
On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Apart from above, I think for this patch, cat version bump is required
  as I have modified system catalog.  However I have not done the
  same in patch as otherwise it will be bit difficult to take performance
  data.
 
  One regression failed on linux due to spacing issue which is
  fixed in attached patch.

 I took another read through this patch.  Here are some further review
comments:

 1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
 both num_to_free and tmp_num_to_free.


num_to_free is used to accumulate total number of buffers that are
freed in one cycle of BgMoveBuffersToFreelist() which is reported
for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary
number which is a count of number of buffers to be freed in one sub-cycle
(inner while loop)


 I'd get rid of tmp_num_to_free
 and move the declaration of num_to_free inside the outer loop.  I'd
 also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
 tmp_recent_backend_clocksweep into the innermost scope in which they
 are used.

okay, I have moved the tmp_* variables in innermost scope.


 2. Also in that function, I think the innermost bit of logic could be
 rewritten more compactly, and in such a way as to make it clearer for
 what set of instructions the buffer header will be locked.
 LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if
 (bufHdr-usage_count  0) bufHdr-usage_count--; else add_to_freelist
 = true; } UnlockBufHdr(bufHdr); if (add_to_freelist 
 StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

Changed as per suggestion.

 3. This comment is now obsolete:

 +   /*
 +* If bgwriterLatch is set, we need to waken the bgwriter, but we
should
 +* not do so while holding freelist_lck; so set it after
releasing the
 +* freelist_lck.  This is annoyingly tedious, but it happens
 at most once
 +* per bgwriter cycle, so the performance hit is minimal.
 +*/
 +

 We're not actually holding any lock in need of releasing at that point
 in the code, so this can be shortened to If bgwriterLatch is set, we
 need to waken the bgwriter.

Changed as per suggestion.


  * Ideally numFreeListBuffers should get called under freelist
spinlock,

 That doesn't make any sense.  numFreeListBuffers is a variable, so you
 can't call it.  The value should be *read* under the spinlock, but
 it is.  I think this whole comment can be deleted and replaced with
 If the number of free buffers has fallen below the low water mark,
 awaken the bgreclaimer to repopulate it.

Changed as per suggestion.

 4. StrategySyncStartAndEnd() is kind of a mess.  One, it can return
 the same victim buffer that's being handed out - at almost the same
 time - to a backend running the clock sweep; if it does, they'll fight
 over the buffer.  Two, the *end out parameter actually returns a
 count, not an endpoint.  I think we should have
 BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
 top of the inner loop rather than the bottom, and change
 StrategySyncStartAndEnd() so that it knows nothing about
 victimbuf_lck.  Let's also change StrategyGetBuffer() to call
 StrategySyncNextVictimBuffer() so that the logic is centralized in one
 place, and rename StrategySyncStartAndEnd() to something that better
 matches its revised purpose.

Changed as per suggestion.  I have also updated
StrategySyncNextVictimBuffer() such that it increments completePasses
on completion of cycle as I think it is appropriate to update it, even when
clock sweep is done by bgreclaimer.


 Maybe StrategyGetReclaimInfo().

I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.


 5. Have you tested that this new bgwriter statistic is actually
 working?  Because it looks to me like BgMoveBuffersToFreelist is
 changing BgWriterStats but never calling pgstat_send_bgwriter(), which
 I'm thinking will mean the counters accumulate forever inside the
 reclaimer but never get forwarded to the stats collector.

pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.


 6. StrategyInitialize() uses #defines for
 HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
 LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
 for clamping.  Let's have constants for all of those (and omit mention
 of the specific values in the comments).

Changed as per suggestion.

 7. The line you've added to the definition of view pg_stat_bgwriter
 doesn't seem to be indented the same way as all the others.  Tab vs.
 space problem?

Fixed.

Performance Data:

Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Amit Kapila
On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood 
mark.kirkw...@catalyst.net.nz wrote:

 On 05/09/14 23:50, Amit Kapila wrote:

 On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
   FWIW below are some test results on the 60 core beast with this patch
 applied to 9.4. I'll need to do more runs to iron out the variation,
   but it looks like the patch helps the standard (write heavy) pgbench
 workload a little, and clearly helps the read only case.
  

 Thanks for doing the test.  I think if possible you can take
 the performance data with higher scale factor (4000) as it
 seems your m/c has 1TB of RAM.  You might want to use
 latest patch I have posted today.


 Here's some fairly typical data from read-write and read-only runs at
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not
seeing much variation between repeated read-write runs with the same config
(which is nice - sleep 30 and explicit checkpoint call between each one
seem to help there).

 Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be
better at higher client counts than beta1 was...

 In terms of the effect of the patch - looks pretty similar to the scale
2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even better
perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-10 Thread Mark Kirkwood

On 10/09/14 18:54, Amit Kapila wrote:

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:
 
  In terms of the effect of the patch - looks pretty similar to the
scale 2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even
better perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.




No worries, this is looking like a patch we're going to apply to 9.4 to 
make the 60 core beast scale a bit better, so thanks very much for your 
work in this area.


If you would like more tests run at higher scales let me know (we have 
two of these machines at pre-production state currently so I can run 
benchmarks as reqd)!


Regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Amit Kapila
On Tue, Sep 9, 2014 at 3:11 AM, Thom Brown t...@linux.com wrote:
 On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
 
  Apart from above, I think for this patch, cat version bump is required
  as I have modified system catalog.  However I have not done the
  same in patch as otherwise it will be bit difficult to take performance
  data.

 One regression failed on linux due to spacing issue which is
 fixed in attached patch.


 Here's a set of test results against this patch:

Many thanks for taking the performance data.  This data clearly shows
that there is a performance improvement at even lower configuration,
however the real benefit of the patch can be seen with higher core
m/c and with larger RAM (can contain all the data).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Mark Kirkwood

On 05/09/14 23:50, Amit Kapila wrote:

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
wrote:
 
  On 04/09/14 14:42, Amit Kapila wrote:
 
  On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz mailto:mark.kirkw...@catalyst.net.nz
  wrote:
 
 
 
  Hi Amit,
 
  Results look pretty good. Does it help in the read-write case too?
 
 
  Last time I ran the tpc-b test of pgbench (results of which are
  posted earlier in this thread), there doesn't seem to be any major
  gain for that, however for cases where read is predominant, you
  might see better gains.
 
  I am again planing to take that data in next few days.
 
 
  FWIW below are some test results on the 60 core beast with this patch
applied to 9.4. I'll need to do more runs to iron out the variation,
  but it looks like the patch helps the standard (write heavy) pgbench
workload a little, and clearly helps the read only case.
 

Thanks for doing the test.  I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM.  You might want to use
latest patch I have posted today.



Here's some fairly typical data from read-write and read-only runs at 
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not 
seeing much variation between repeated read-write runs with the same 
config (which is nice - sleep 30 and explicit checkpoint call between 
each one seem to help there).


Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be 
better at higher client counts than beta1 was...


In terms of the effect of the patch - looks pretty similar to the scale 
2000 results for read-write, but read-only is a different and more 
interesting story - unpatched 9.4 is noticeably impacted in the higher 
client counts, whereas the patched version scales as well (or even 
better perhaps) than in the scale 2000 case.


read write (600s)

Clients  | tps| tps (unpatched)
-++
  6  |   9395 |   9334
  12 |  16605 |  16525
  24 |  24634 |  24910
  48 |  32170 |  31275
  96 |  35675 |  36533
 192 |  35579 |  31137
 384 |  30528 |  28308


read only (300s)

Clients  | tps| tps (unpatched)
-++
  6  |  35743 |   35362
  12 | 111019 |  106579
  24 | 199746 |  160305
  48 | 327026 |  198407
  96 | 379184 |  171863
 192 | 356623 |  152224
 384 | 340878 |  128308


regards

Mark





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-09 Thread Michael Paquier
On Tue, Sep 9, 2014 at 3:46 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 One regression failed on linux due to spacing issue which is
 fixed in attached patch.
I just read the latest patch by curiosity, wouldn't it make more sense
to split the patch into two different patches for clarity: one for the
reclaimer worker centralized around BgMoveBuffersToFreelist and a
second for the pg_stat portion? Those seem two different features.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Merlin Moncure
On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Client Count/Patch_Ver (tps) 8 16 32 64 128
 HEAD 58614 107370 140717 104357 65010
 Patch 60092 113564 165014 213848 216065

 This data is median of 3 runs, detailed report is attached with mail.
 I have not repeated the test for all configurations, as there is no
 major change in design/algorithm which can effect performance.
 Mark has already taken tpc-b data which ensures that there is
 no problem with it, however I will also take it once with latest version.

Well, these numbers are pretty much amazing.  Question: It seems
there's obviously quite a bit of contention left;  do you think
there's still a significant amount of time in the clock sweep, or is
the primary bottleneck the buffer mapping locks?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Robert Haas
On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 Apart from above, I think for this patch, cat version bump is required
 as I have modified system catalog.  However I have not done the
 same in patch as otherwise it will be bit difficult to take performance
 data.

 One regression failed on linux due to spacing issue which is
 fixed in attached patch.

I took another read through this patch.  Here are some further review comments:

1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
both num_to_free and tmp_num_to_free.  I'd get rid of tmp_num_to_free
and move the declaration of num_to_free inside the outer loop.  I'd
also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
tmp_recent_backend_clocksweep into the innermost scope in which they
are used.

2. Also in that function, I think the innermost bit of logic could be
rewritten more compactly, and in such a way as to make it clearer for
what set of instructions the buffer header will be locked.
LockBufHdr(bufHdr); if (bufHdr-refcount == 0) { if
(bufHdr-usage_count  0) bufHdr-usage_count--; else add_to_freelist
= true; } UnlockBufHdr(bufHdr); if (add_to_freelist 
StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

3. This comment is now obsolete:

+   /*
+* If bgwriterLatch is set, we need to waken the bgwriter, but we should
+* not do so while holding freelist_lck; so set it after releasing the
+* freelist_lck.  This is annoyingly tedious, but it happens
at most once
+* per bgwriter cycle, so the performance hit is minimal.
+*/
+

We're not actually holding any lock in need of releasing at that point
in the code, so this can be shortened to If bgwriterLatch is set, we
need to waken the bgwriter.

 * Ideally numFreeListBuffers should get called under freelist spinlock,

That doesn't make any sense.  numFreeListBuffers is a variable, so you
can't call it.  The value should be *read* under the spinlock, but
it is.  I think this whole comment can be deleted and replaced with
If the number of free buffers has fallen below the low water mark,
awaken the bgreclaimer to repopulate it.

4. StrategySyncStartAndEnd() is kind of a mess.  One, it can return
the same victim buffer that's being handed out - at almost the same
time - to a backend running the clock sweep; if it does, they'll fight
over the buffer.  Two, the *end out parameter actually returns a
count, not an endpoint.  I think we should have
BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
top of the inner loop rather than the bottom, and change
StrategySyncStartAndEnd() so that it knows nothing about
victimbuf_lck.  Let's also change StrategyGetBuffer() to call
StrategySyncNextVictimBuffer() so that the logic is centralized in one
place, and rename StrategySyncStartAndEnd() to something that better
matches its revised purpose.  Maybe StrategyGetReclaimInfo().

5. Have you tested that this new bgwriter statistic is actually
working?  Because it looks to me like BgMoveBuffersToFreelist is
changing BgWriterStats but never calling pgstat_send_bgwriter(), which
I'm thinking will mean the counters accumulate forever inside the
reclaimer but never get forwarded to the stats collector.

6. StrategyInitialize() uses #defines for
HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
for clamping.  Let's have constants for all of those (and omit mention
of the specific values in the comments).

7. The line you've added to the definition of view pg_stat_bgwriter
doesn't seem to be indented the same way as all the others.  Tab vs.
space problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Thom Brown
On 5 September 2014 14:19, Amit Kapila amit.kapil...@gmail.com wrote:

 On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
 
  Apart from above, I think for this patch, cat version bump is required
  as I have modified system catalog.  However I have not done the
  same in patch as otherwise it will be bit difficult to take performance
  data.

 One regression failed on linux due to spacing issue which is
 fixed in attached patch.


Here's a set of test results against this patch:

8-core AMD FX-9590, 32GB RAM

Config:
checkpoint_segments = 256
checkpoint_timeout = 15min
shared_buffers = 8GB

pgbench scale factor = 3000
run duration time = 5min

HEAD
Client Count/No. Of Runs (tps) 8 16 32 64 128
Run-1 31568 41890 48638 49266 41845
Run-2 31613 41879 48597 49332 41736
Run-3 31674 41987 48647 49320 41745


Patch=scalable_buffer_eviction_v7.patch
Client Count/No. Of Runs (tps) 8 16 32 64 128
Run-1 31880 42943 49359 49901 43779
Run-2 32150 43078 48934 49828 43769
Run-3 32323 42894 49481 49600 43529

-- 
Thom


Re: [HACKERS] Scaling shared buffer eviction

2014-09-08 Thread Amit Kapila
On Mon, Sep 8, 2014 at 10:12 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  Client Count/Patch_Ver (tps) 8 16 32 64 128
  HEAD 58614 107370 140717 104357 65010
  Patch 60092 113564 165014 213848 216065
 
  This data is median of 3 runs, detailed report is attached with mail.
  I have not repeated the test for all configurations, as there is no
  major change in design/algorithm which can effect performance.
  Mark has already taken tpc-b data which ensures that there is
  no problem with it, however I will also take it once with latest
version.

 Well, these numbers are pretty much amazing.  Question: It seems
 there's obviously quite a bit of contention left;  do you think
 there's still a significant amount of time in the clock sweep, or is
 the primary bottleneck the buffer mapping locks?

I think contention around clock sweep is very minimal and for buffer
mapping locks it has reduced significantly (you can refer upthread
some of LWLOCK stat data, I have posted after this patch), but
might be we can get more out of it by improving hash table
concurrency.  However apart from both of the above, the next thing
I have seen in profiles was palloc (at least that is what I remember
when I had done some profiling few months back during the
development of this patch).  It seems to me at that time a totally
different optimisation, so I left it for another patch.
Another point is that the m/c on which I am doing performance
test has 16 cores (64 hardware threads), so not sure how much
scaling we can expect.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-05 Thread Amit Kapila
On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I have updated the patch to address the feedback.  Main changes are:
 
  1. For populating freelist, have a separate process (bgreclaimer)
  instead of doing it by bgwriter.
  2. Autotune the low and high threshold values for buffers
  in freelist. I have used the formula as suggested by you upthread.
  3. Cleanup of locking regimen as discussed upthread (completely
  eliminated BufFreelist Lock).
  4. Improved comments and general code cleanup.

 Overall this looks quite promising to me.

 I had thought to call the new process just bgreclaim rather than
 bgreclaimer, but perhaps your name is better after all.  At least,
 it matches what we do elsewhere.  But I don't care for the use
 Bgreclaimer; let's do BgReclaimer if we really need mixed-case, or
 else bgreclaimer.

Changed it to bgreclaimer.


 This is unclear:

 +buffers for replacement.  Earlier to protect freelist, we use LWLOCK as
that
 +is needed to perform clock sweep which is a longer operation, however
now we
 +are using two spinklocks freelist_lck and victimbuf_lck to perform
operations
 +on freelist and run clock sweep respectively.

 I would drop the discussion of what was done before and say something
 like this: The data structures relating to buffer eviction are
 protected by two spinlocks.  freelist_lck protects the freelist and
 related data structures, while victimbuf_lck protects information
 related to the current clock sweep condition.

Changed, but I have not used exact wording mentioned above, let me know
if new wording used is okay?

 +always in this list.  We also throw buffers into this list if we consider
 +their pages unlikely to be needed soon; this is done by background
process
 +reclaimer.  The list is singly-linked using fields in the

 I suggest: Allocating pages from this list is much cheaper than
 running the clock sweep algorithm, which may encounter many buffers
 that are poor candidates for eviction before finding a good candidate.
 Therefore, we have a background process called bgreclaimer which works
 to keep this list populated.

Changed as per your suggestion.

 +Background Reclaimer's Processing
 +-

 I suggest titling this section Background Reclaim.

 +The background reclaimer is designed to move buffers to freelist that are

 I suggest replacing the first three words of this sentence with
bgreclaimer.

As per discussion in thread, I have kept it as it is.

 +and move the the unpinned and zero usage count buffers to freelist.  It
 +keep on doing this until the number of buffers in freelist become equal
 +high threshold of freelist.

 s/keep/keeps/
 s/become equal/reaches the/
 s/high threshold/high water mark/
 s/of freelist//

Changed as per your suggestion.

 Please change the other places that say threshold to use the water
 mark terminology.

 +if (StrategyMoveBufferToFreeListEnd (bufHdr))

 Extra space.

 + * buffers in consecutive cycles.

 s/consecutive/later/

 +/* Execute the LRU scan */

 s/LRU scan/clock sweep/ ?

Changed as per your suggestion.


 +while (tmp_num_to_free  0)

 I am not sure it's a good idea for this value to be fixed at loop
 start and then just decremented.  Shouldn't we loop and do the whole
 thing over once we reach the high watermark, only stopping when
 StrategySyncStartAndEnd() says num_to_free is 0?

Okay, changed the loop as per your suggestion.

 +/* choose next victim buffer to clean. */

 This process doesn't clean buffers; it puts them on the freelist.

Right. Changed it to match what it does.

 + * high threshold of freelsit), we drasticaly reduce the odds for

 Two typos.

Fixed.

 + * of buffers in freelist fall below low threshold of freelist.

 s/fall/falls/

Changed as per your suggestion.

 In freelist.c, it seems like a poor idea to have two spinlocks as
 consecutive structure members; they'll be in the same cache line,
 leading to false sharing.  If we merge them into a single spinlock,
 does that hurt performance?  If we put them further apart, e.g. by
 moving the freelist_lck to the start of the structure, followed by the
 latches, and leaving victimbuf_lck where it is, does that help
 performance?

As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.

 +/*
 + * If the buffer is pinned or has a nonzero usage_count,
 we cannot use
 + * it; discard it and retry.  (This can only happen if
VACUUM put a
 + * valid buffer in the freelist and then someone else
 used it before
 + * we got to it.  It's probably impossible altogether as
 of 8.3, but
 + * we'd better check anyway.)
 + */
 +

 This comment is clearly obsolete.

Removed.

  I have not yet added statistics 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-05 Thread Amit Kapila
On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz
wrote:

 On 04/09/14 14:42, Amit Kapila wrote:

 On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood 
mark.kirkw...@catalyst.net.nz
 wrote:



 Hi Amit,

 Results look pretty good. Does it help in the read-write case too?


 Last time I ran the tpc-b test of pgbench (results of which are
 posted earlier in this thread), there doesn't seem to be any major
 gain for that, however for cases where read is predominant, you
 might see better gains.

 I am again planing to take that data in next few days.


 FWIW below are some test results on the 60 core beast with this patch
applied to 9.4. I'll need to do more runs to iron out the variation,
 but it looks like the patch helps the standard (write heavy) pgbench
workload a little, and clearly helps the read only case.


Thanks for doing the test.  I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM.  You might want to use
latest patch I have posted today.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Amit Kapila
On Wed, Sep 3, 2014 at 8:03 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

  +while (tmp_num_to_free  0)
 
  I am not sure it's a good idea for this value to be fixed at loop
  start and then just decremented.
 
  It is based on the idea what bgwriter does for num_to_scan and
  calling it once has advantage that we need to take freelist_lck
  just once.

 Right, we shouldn't call it every loop iteration.  However, consider
 this scenario: there are no remaining buffers on the list and the high
 watermark is 2000.  We add 2000 buffers to the list.  But by the time
 we get done, other backends have already done 500 more allocations, so
 now there are only 1500 buffers on the list.  If this should occur, we
 should add an additional 500 buffers to the list before we consider
 sleeping.  We want bgreclaimer to be able to run continuously if the
 demand for buffers is high enough.

Its not difficult to handle such cases, but it can have downside also
for the cases where demand from backends is not high.
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers.  Another point to keep in mind here is that in this
loop we are reducing the usage_count of buffers as well incase we don't
find buffer with usage_count=0.  OTOH if we let bgreclaimer to go for
sleep after it moves initially identified buffers, then the backend which
first finds that the buffers in freelist falls below low water mark can
wake bgreclaimer.


  In freelist.c, it seems like a poor idea to have two spinlocks as
  consecutive structure members; they'll be in the same cache line,
  leading to false sharing.  If we merge them into a single spinlock,
  does that hurt performance?
 
  I have kept them separate so that backends searching for a buffer
  in freelist doesn't contend with bgreclaimer (while doing clock sweep)
  or clock sweep being done by other backends.  I think it will be bit
  tricky to devise a test where this can hurt, however it doesn't seem
  too bad to have two separate locks in this case.

 It's not.  But if they are in the same cache line, they will behave
 almost like one lock, because the CPU will lock the entire cache line
 for each atomic op.  See Tom's comments upthread.

I think to avoid having them in same cache line, we might need to
add some padding (at least 72 bytes) as the structure size including both
the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.
I have taken performance data as well by keeping them further apart
as suggested by you upthread and by introducing padding, but the
difference in performance is less than 1.5% (on 64 and 128 client count)
which also might be due to variation of data across runs.  So now to
proceed we have below options:

a. use two spinlocks as in patch, but keep them as far apart as possible.
This might not have an advantage as compare to what is used currently
in patch, but in future we can adding padding to take the advantage if
possible (currently on PPC64, it doesn't show any noticeable advantage,
however on some other m/c, it might show the advantage).

b. use only one spinlock, this can have disadvantage in certain cases
as mentioned upthread, however those might not be usual cases, so for
now we can consider them as lower priority and can choose this option.

Another point in this regard is that I have to make use of volatile
pointer to prevent code rearrangement in this case.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Amit Kapila
On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote:


  Performance Data:
  ---
 
  Configuration and Db Details
  IBM POWER-7 16 cores, 64 hardware threads
  RAM = 64GB
  Database Locale =C
  checkpoint_segments=256
  checkpoint_timeout=15min
  scale factor = 3000
  Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
  Duration of each individual run = 5mins
 
  All the data is in tps and taken using pgbench read-only load

 Common configuration remains same as above.

 Shared_Buffers = 500MB
   Client Count/Patch_Ver 8 16 32 64 128  HEAD 56248 100112 121341 81128
 56552  Patch 59389 112483 157034 185740 166725
 ..



 Observations
 -
 1. Performance improvement is upto 2~3 times for higher client
 counts (64, 128).
 2. For lower client count (8), we can see 2~5 % performance
 improvement.
 3. Overall, this improves the read scalability.
 4. For lower number of shared buffers, we see that there is a minor
 dip in tps even after patch (it might be that we can improve it by
 tuning higher water mark for the number of buffers on freelist, I will
 try this by varying high water mark).


I have taken performance data by varying high and low mater marks
for lower value of shared buffers which is as below:

Shared_buffers = 500MB
Scale_factor = 3000

HM - High water mark, 0.5 means 0.5% of total shared buffers
LM - Low water mark, 20 means 20% of HM.

  Client Count/Patch_Ver (Data in tps) 128  HM=0.5;LM=20 166725  HM=1;LM=20
166556  HM=2;LM=30 166463  HM=5;LM=30 166107  HM=10;LM=30 167231


Observation

a. There is hardly any difference by varying High and Low water marks
as compared to default values currently used in patch.
b. I think this minor dip as compare to 64 client count is because one
this m/c has 64 hardware threads due which scaling beyond 64 client
count is difficult and second at relatively lower buffer count (500MB),
there is still minor contention around Buf Mapping locks.

In general, I think with patch the scaling is much better (2 times) than
HEAD, even when shared buffers are less and client count is high,
so this is not an issue.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Robert Haas
On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Its not difficult to handle such cases, but it can have downside also
 for the cases where demand from backends is not high.
 Consider in above case if instead of 500 more allocations, it just
 does 5 more allocations, then bgreclaimer will again have to go through
 the list and move 5 buffers and same can happen again by the time
 it moves 5 buffers.

That's exactly the scenario in which we *want* the looping behavior.
If that's happening, then it means it's taking us exactly as long to
find 5 buffers as it takes the rest of the system to use 5 buffers.
We need to run continuously to keep up.

 It's not.  But if they are in the same cache line, they will behave
 almost like one lock, because the CPU will lock the entire cache line
 for each atomic op.  See Tom's comments upthread.

 I think to avoid having them in same cache line, we might need to
 add some padding (at least 72 bytes) as the structure size including both
 the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.
 I have taken performance data as well by keeping them further apart
 as suggested by you upthread and by introducing padding, but the
 difference in performance is less than 1.5% (on 64 and 128 client count)
 which also might be due to variation of data across runs.  So now to
 proceed we have below options:

 a. use two spinlocks as in patch, but keep them as far apart as possible.
 This might not have an advantage as compare to what is used currently
 in patch, but in future we can adding padding to take the advantage if
 possible (currently on PPC64, it doesn't show any noticeable advantage,
 however on some other m/c, it might show the advantage).

 b. use only one spinlock, this can have disadvantage in certain cases
 as mentioned upthread, however those might not be usual cases, so for
 now we can consider them as lower priority and can choose this option.

I guess I don't care that much.  I only mentioned it because Tom
brought it up; I don't really see a big problem with the way you're
doing it.

 Another point in this regard is that I have to make use of volatile
 pointer to prevent code rearrangement in this case.

Yep.  Or we need to get off our duff and fix it so that's not necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Its not difficult to handle such cases, but it can have downside also
 for the cases where demand from backends is not high.
 Consider in above case if instead of 500 more allocations, it just
 does 5 more allocations, then bgreclaimer will again have to go through
 the list and move 5 buffers and same can happen again by the time
 it moves 5 buffers.

 That's exactly the scenario in which we *want* the looping behavior.
 If that's happening, then it means it's taking us exactly as long to
 find 5 buffers as it takes the rest of the system to use 5 buffers.
 We need to run continuously to keep up.

That's what I was thinking, as long as there isn't a lot of
overhead to starting and finishing a cycle.  If there is, my
inclination would be to try to fix that rather than to sleep and
hope things don't get out of hand before it wakes up again.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Alvaro Herrera
Robert Haas wrote:
 On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila amit.kapil...@gmail.com wrote:
  +Background Reclaimer's Processing
  +-
 
  I suggest titling this section Background Reclaim.
 
  I don't mind changing it, but currently used title is based on similar
  title Background Writer's Processing.  It is used in previous
  paragraph.  Is there a reason to title this differently?
 
 Oh, I didn't see that.  Seems like weird phrasing to me, but I guess
 it's probably better to keep it consistent.

... or you can also change the other one.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-04 Thread Mark Kirkwood

On 04/09/14 14:42, Amit Kapila wrote:

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz
wrote:



Hi Amit,

Results look pretty good. Does it help in the read-write case too?


Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.



FWIW below are some test results on the 60 core beast with this patch 
applied to 9.4. I'll need to do more runs to iron out the variation, but 
it looks like the patch helps the standard (write heavy) pgbench 
workload a little, and clearly helps the read only case.



4x E7-4890 15 cores each.
1 TB ram
16x Toshiba PX02SS SATA SSD
4x Samsung NVMe XS1715 PCIe SSD

Ubuntu 14.04  (Linux 3.13)
Postgres 9.4 beta2
+ buffer eviction patch v5

Pgbench

scale 2000

Non default params:

max_connections = 400;
shared_buffers = 10GB;
maintenance_work_mem = 1GB;
effective_io_concurrency = 10;
wal_buffers = 256MB;
checkpoint_segments = 1920;
checkpoint_completion_target = 0.8;
ssl = 'off';
wal_sync_method = 'open_datasync';

read write

elapsed 600s

Clients  | tps   | tps (unpatched)
-+---+
  6  |  8279 |  8328
  12 | 16260 | 16381
  24 | 23639 | 23451
  48 | 31430 | 31004
  96 | 38516 | 34777
 192 | 33535 | 32443
 384 | 27978 | 25068
 384 | 30589 | 28798


read only

elapsed 300s

Clients  | tps| tps (unpatched)
-++
  6  |  57654 |  57255
  12 | 111361 | 112360
  24 | 220304 | 187967
  48 | 384567 | 230961
  96 | 380309 | 241947
 192 | 330865 | 214570
 384 | 315516 | 207548


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-03 Thread Amit Kapila
On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  I have updated the patch to address the feedback.  Main changes are:
 
  1. For populating freelist, have a separate process (bgreclaimer)
  instead of doing it by bgwriter.
  2. Autotune the low and high threshold values for buffers
  in freelist. I have used the formula as suggested by you upthread.
  3. Cleanup of locking regimen as discussed upthread (completely
  eliminated BufFreelist Lock).
  4. Improved comments and general code cleanup.

 +Background Reclaimer's Processing
 +-

 I suggest titling this section Background Reclaim.

I don't mind changing it, but currently used title is based on similar
title Background Writer's Processing.  It is used in previous
paragraph.  Is there a reason to title this differently?

 +The background reclaimer is designed to move buffers to freelist that are

 I suggest replacing the first three words of this sentence with
bgreclaimer.

Again what I have used is matching with BgWriter's explanation. I thought
it would be better if wording used is similar.


 +while (tmp_num_to_free  0)

 I am not sure it's a good idea for this value to be fixed at loop
 start and then just decremented.

It is based on the idea what bgwriter does for num_to_scan and
calling it once has advantage that we need to take freelist_lck
just once.

 Shouldn't we loop and do the whole
 thing over once we reach the high watermark, only stopping when
 StrategySyncStartAndEnd() says num_to_free is 0?

Do you mean to say that for high water mark check, we should
always refer StrategySyncStartAndEnd() rather than getting the
value in begining?

Are you thinking that somebody else would have already put some
buffers onto freelist which would change initially identified high water
mark?  I think that can be done only during very few and less used
operations.  Do you think for that we start referring
StrategySyncStartAndEnd() in loop?

 In freelist.c, it seems like a poor idea to have two spinlocks as
 consecutive structure members; they'll be in the same cache line,
 leading to false sharing.  If we merge them into a single spinlock,
 does that hurt performance?

I have kept them separate so that backends searching for a buffer
in freelist doesn't contend with bgreclaimer (while doing clock sweep)
or clock sweep being done by other backends.  I think it will be bit
tricky to devise a test where this can hurt, however it doesn't seem
too bad to have two separate locks in this case.

 If we put them further apart, e.g. by
 moving the freelist_lck to the start of the structure, followed by the
 latches, and leaving victimbuf_lck where it is, does that help
 performance?

I can investigate.

 +/*
 + * If the buffer is pinned or has a nonzero usage_count,
 we cannot use
 + * it; discard it and retry.  (This can only happen if
VACUUM put a
 + * valid buffer in the freelist and then someone else
 used it before
 + * we got to it.  It's probably impossible altogether as
 of 8.3, but
 + * we'd better check anyway.)
 + */
 +

 This comment is clearly obsolete.

Okay, but this patch hasn't changed anything w.r.t above comment,
so I haven't changed it. Do you want me to remove second part of
comment starting with (This can only happen?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-03 Thread Robert Haas
On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 +Background Reclaimer's Processing
 +-

 I suggest titling this section Background Reclaim.

 I don't mind changing it, but currently used title is based on similar
 title Background Writer's Processing.  It is used in previous
 paragraph.  Is there a reason to title this differently?

Oh, I didn't see that.  Seems like weird phrasing to me, but I guess
it's probably better to keep it consistent.

 +The background reclaimer is designed to move buffers to freelist that are

 I suggest replacing the first three words of this sentence with
 bgreclaimer.

 Again what I have used is matching with BgWriter's explanation. I thought
 it would be better if wording used is similar.

OK.

 +while (tmp_num_to_free  0)

 I am not sure it's a good idea for this value to be fixed at loop
 start and then just decremented.

 It is based on the idea what bgwriter does for num_to_scan and
 calling it once has advantage that we need to take freelist_lck
 just once.

Right, we shouldn't call it every loop iteration.  However, consider
this scenario: there are no remaining buffers on the list and the high
watermark is 2000.  We add 2000 buffers to the list.  But by the time
we get done, other backends have already done 500 more allocations, so
now there are only 1500 buffers on the list.  If this should occur, we
should add an additional 500 buffers to the list before we consider
sleeping.  We want bgreclaimer to be able to run continuously if the
demand for buffers is high enough.

 In freelist.c, it seems like a poor idea to have two spinlocks as
 consecutive structure members; they'll be in the same cache line,
 leading to false sharing.  If we merge them into a single spinlock,
 does that hurt performance?

 I have kept them separate so that backends searching for a buffer
 in freelist doesn't contend with bgreclaimer (while doing clock sweep)
 or clock sweep being done by other backends.  I think it will be bit
 tricky to devise a test where this can hurt, however it doesn't seem
 too bad to have two separate locks in this case.

It's not.  But if they are in the same cache line, they will behave
almost like one lock, because the CPU will lock the entire cache line
for each atomic op.  See Tom's comments upthread.

 Okay, but this patch hasn't changed anything w.r.t above comment,
 so I haven't changed it. Do you want me to remove second part of
 comment starting with (This can only happen?

Right.  Clearly it can happen again once we have this patch: that's
the entire point of the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-03 Thread Mark Kirkwood

On 03/09/14 16:22, Amit Kapila wrote:

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote:


On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com

wrote:


I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.


I have collected more data with the patch.  I understand that you
have given more review comments due to which patch require
changes, however I think it will not effect the performance data
to a great extent and I have anyway taken the data, so sharing the
same.



Performance Data:
---

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout=15min
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load


Common configuration remains same as above.


Forgot to mention that data is a median of 3 runs and attached
sheet contains data for individual runs.




Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-09-03 Thread Amit Kapila
On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz
wrote:


 Hi Amit,

 Results look pretty good. Does it help in the read-write case too?

Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


  1   2   >