Re: [HACKERS] Scaling shared buffer eviction

2014-09-02 Thread Robert Haas
On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 I have updated the patch to address the feedback.  Main changes are:

 1. For populating freelist, have a separate process (bgreclaimer)
 instead of doing it by bgwriter.
 2. Autotune the low and high threshold values for buffers
 in freelist. I have used the formula as suggested by you upthread.
 3. Cleanup of locking regimen as discussed upthread (completely
 eliminated BufFreelist Lock).
 4. Improved comments and general code cleanup.

Overall this looks quite promising to me.

I had thought to call the new process just bgreclaim rather than
bgreclaimer, but perhaps your name is better after all.  At least,
it matches what we do elsewhere.  But I don't care for the use
Bgreclaimer; let's do BgReclaimer if we really need mixed-case, or
else bgreclaimer.

This is unclear:

+buffers for replacement.  Earlier to protect freelist, we use LWLOCK as that
+is needed to perform clock sweep which is a longer operation, however now we
+are using two spinklocks freelist_lck and victimbuf_lck to perform operations
+on freelist and run clock sweep respectively.

I would drop the discussion of what was done before and say something
like this: The data structures relating to buffer eviction are
protected by two spinlocks.  freelist_lck protects the freelist and
related data structures, while victimbuf_lck protects information
related to the current clock sweep condition.

+always in this list.  We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background process
+reclaimer.  The list is singly-linked using fields in the

I suggest: Allocating pages from this list is much cheaper than
running the clock sweep algorithm, which may encounter many buffers
that are poor candidates for eviction before finding a good candidate.
Therefore, we have a background process called bgreclaimer which works
to keep this list populated.

+Background Reclaimer's Processing
+-

I suggest titling this section Background Reclaim.

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with bgreclaimer.

+and move the the unpinned and zero usage count buffers to freelist.  It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.

s/keep/keeps/
s/become equal/reaches the/
s/high threshold/high water mark/
s/of freelist//

Please change the other places that say threshold to use the water
mark terminology.

+if (StrategyMoveBufferToFreeListEnd (bufHdr))

Extra space.

+ * buffers in consecutive cycles.

s/consecutive/later/

+/* Execute the LRU scan */

s/LRU scan/clock sweep/ ?

+while (tmp_num_to_free  0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented.  Shouldn't we loop and do the whole
thing over once we reach the high watermark, only stopping when
StrategySyncStartAndEnd() says num_to_free is 0?

+/* choose next victim buffer to clean. */

This process doesn't clean buffers; it puts them on the freelist.

+ * high threshold of freelsit), we drasticaly reduce the odds for

Two typos.

+ * of buffers in freelist fall below low threshold of freelist.

s/fall/falls/

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing.  If we merge them into a single spinlock,
does that hurt performance?  If we put them further apart, e.g. by
moving the freelist_lck to the start of the structure, followed by the
latches, and leaving victimbuf_lck where it is, does that help
performance?

+/*
+ * If the buffer is pinned or has a nonzero usage_count,
we cannot use
+ * it; discard it and retry.  (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else
used it before
+ * we got to it.  It's probably impossible altogether as
of 8.3, but
+ * we'd better check anyway.)
+ */
+

This comment is clearly obsolete.

 I have not yet added statistics (buffers_backend_clocksweep) as
 for that we need to add one more variable in BufferStrategyControl
 structure where I have already added few variables for this patch.
 I think it is important to have such a stat available via
 pg_stat_bgwriter, but not sure if it is worth to make the structure
 bit more bulky.

I think it's worth it.

 Another minor point is about changes in lwlock.h
 lwlock.h
 * if you remove a lock, consider leaving a gap in the numbering
 * sequence for the benefit of DTrace and other external debugging
 * scripts.

 As I have removed BufFreelist lock, I have adjusted the numbering
 as well in lwlock.h.  There is a meesage on top of lock definitions
 which suggest to leave gap if we remove any 

Re: [HACKERS] Scaling shared buffer eviction

2014-09-02 Thread Amit Kapila
On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 I have yet to collect data under varying loads, however I have
 collected performance data for 8GB shared buffers which shows
 reasonably good performance and scalability.

 I think the main part left for this patch is more data for various loads
 which I will share in next few days, however I think patch is ready for
 next round of review, so I will mark it as Needs Review.

I have collected more data with the patch.  I understand that you
have given more review comments due to which patch require
changes, however I think it will not effect the performance data
to a great extent and I have anyway taken the data, so sharing the
same.

 Performance Data:
 ---

 Configuration and Db Details
 IBM POWER-7 16 cores, 64 hardware threads
 RAM = 64GB
 Database Locale =C
 checkpoint_segments=256
 checkpoint_timeout=15min
 scale factor = 3000
 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
 Duration of each individual run = 5mins

 All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Shared_Buffers = 500MB
  Client Count/Patch_Ver 8 16 32 64 128  HEAD 56248 100112 121341 81128
56552  Patch 59389 112483 157034 185740 166725

Shared_Buffers = 1GB
  Client Count/Patch_Ver 8 16 32 64 128  HEAD 56401 102557 121643 81686
57091  Patch 59361 114813 157528 188209 167752


Shared_Buffers = 14GB
  Client Count/Patch_Ver 8 16 32 64 128  HEAD 60059 110582 152051 130718
97014  Patch 61462 117377 169767 225803 229083


Shared_Buffers = 15GB
  Client Count/Patch_Ver 8 16 32 64 128  HEAD 60005 112928 153650 135203
36343  Patch 61345 115569 168767 226150 36985


Observations
-
1. Performance improvement is upto 2~3 times for higher client
counts (64, 128).
2. For lower client count (8), we can see 2~5 % performance
improvement.
3. Overall, this improves the read scalability.
4. For lower number of shared buffers, we see that there is a minor
dip in tps even after patch (it might be that we can improve it by
tuning higher water mark for the number of buffers on freelist, I will
try this by varying high water mark).
5. For larger shared buffers (15GB), we can see that there is still a
dip at large client count, although situation is not bad as compare to
HEAD.  The reason is that at such high shared buffers and client count,
I/O starts happening because all the data can't be contained in RAM.

I will try to take some data for tpc-b load as well.  Kindly let me know
if you want to see data for some other configuration.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-09-02 Thread Amit Kapila
On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila amit.kapil...@gmail.com wrote:

 On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
 
  I have yet to collect data under varying loads, however I have
  collected performance data for 8GB shared buffers which shows
  reasonably good performance and scalability.
 
  I think the main part left for this patch is more data for various loads
  which I will share in next few days, however I think patch is ready for
  next round of review, so I will mark it as Needs Review.

 I have collected more data with the patch.  I understand that you
 have given more review comments due to which patch require
 changes, however I think it will not effect the performance data
 to a great extent and I have anyway taken the data, so sharing the
 same.


  Performance Data:
  ---
 
  Configuration and Db Details
  IBM POWER-7 16 cores, 64 hardware threads
  RAM = 64GB
  Database Locale =C
  checkpoint_segments=256
  checkpoint_timeout=15min
  scale factor = 3000
  Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
  Duration of each individual run = 5mins
 
  All the data is in tps and taken using pgbench read-only load

 Common configuration remains same as above.

Forgot to mention that data is a median of 3 runs and attached
sheet contains data for individual runs.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


perf_read_scalability_data_v5.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-27 Thread Robert Haas
On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:
 I think you should get rid of BufFreelistLock completely and just
 decide that freelist_lck will protect everything the freeNext links, plus
 everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
 will protect nextVictimBuffer and nothing else.

 Another point is I think it will be better to protect
 StrategyControl-completePasses with victimbuf_lck rather than
 freelist_lck, as when we are going to update it we will already be
 holding the victimbuf_lck and it doesn't make much sense to release
 the victimbuf_lck and reacquire freelist_lck to update it.

 I'm rather concerned by this cavalier assumption that we can protect
 fields a,b,c with one lock and fields x,y,z in the same struct with some
 other lock.

 A minimum requirement for that to work safely at all is that the fields
 are of atomically fetchable/storable widths; which might be okay here
 but it's a restriction that bears thinking about (and documenting).

 But quite aside from safety, the fields are almost certainly going to
 be in the same cache line which means contention between processes that
 are trying to fetch or store them concurrently.  For a patch whose sole
 excuse for existence is to improve performance, that should be a very
 scary concern.

 (And yes, I realize these issues already affect the freelist.  Perhaps
 that's part of the reason we have performance issues with it.)

False sharing is certainly a concern that has crossed my mine while
looking at Amit's work, but the performance numbers he's posted
upthread are stellar.  Maybe we can squeeze some additional
performance out of this by padding out the cache lines, but it's
probably minor compared to the gains he's already seeing.  I think we
should focus on trying to lock in those gains, and then we can
consider what further things we may want to do after that.  If it
turns out that structure-padding is among those things, that's easy
enough to do as a separate patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-27 Thread Robert Haas
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Today, while working on updating the patch to improve locking
 I found that as now we are going to have a new process, we need
 a separate latch in StrategyControl to wakeup that process.
 Another point is I think it will be better to protect
 StrategyControl-completePasses with victimbuf_lck rather than
 freelist_lck, as when we are going to update it we will already be
 holding the victimbuf_lck and it doesn't make much sense to release
 the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable.  I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer.  I don't see much point
in fine-tuning the locking regimen until that's done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-26 Thread Amit Kapila
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:

 Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup.  Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl-numBufferAllocs++ and
SetLatch(StrategyControl-bgwriterLatch) without any locking.  That's not
OK.  I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
will protect nextVictimBuffer and nothing else.

 Then, in StrategyGetBuffer, acquire the freelist_lck at the point where
the LWLock is acquired today.  Increment StrategyControl-numBufferAllocs;
save the values of StrategyControl-bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity.  Release the spinlock.
 Then, set the bgwriterLatch if needed.  In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock.  Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.


Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl-completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I thought it is better to mention about above points so that if you have
any different thoughts about it, then it is better to discuss them now
rather than after I take performance data with this locking protocol.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-08-26 Thread Tom Lane
Amit Kapila amit.kapil...@gmail.com writes:
 On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:
 I think you should get rid of BufFreelistLock completely and just
 decide that freelist_lck will protect everything the freeNext links, plus
 everything in StrategyControl except for nextVictimBuffer.  victimbuf_lck
 will protect nextVictimBuffer and nothing else.

 Another point is I think it will be better to protect
 StrategyControl-completePasses with victimbuf_lck rather than
 freelist_lck, as when we are going to update it we will already be
 holding the victimbuf_lck and it doesn't make much sense to release
 the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently.  For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.

(And yes, I realize these issues already affect the freelist.  Perhaps
that's part of the reason we have performance issues with it.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-26 Thread Amit Kapila
On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Amit Kapila amit.kapil...@gmail.com writes:
  Another point is I think it will be better to protect
  StrategyControl-completePasses with victimbuf_lck rather than
  freelist_lck, as when we are going to update it we will already be
  holding the victimbuf_lck and it doesn't make much sense to release
  the victimbuf_lck and reacquire freelist_lck to update it.

 I'm rather concerned by this cavalier assumption that we can protect
 fields a,b,c with one lock and fields x,y,z in the same struct with some
 other lock.

In some cases, it could be beneficial especially when a,b,c are
going to be more frequently accessed as compare to x,y,z.

 A minimum requirement for that to work safely at all is that the fields
 are of atomically fetchable/storable widths; which might be okay here
 but it's a restriction that bears thinking about (and documenting).

 But quite aside from safety, the fields are almost certainly going to
 be in the same cache line which means contention between processes that
 are trying to fetch or store them concurrently.

I think patch will reduce the contention for some of such variables
(which are accessed during clock sweep) as it will minimize the need
to perform clock sweep by backends.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-08-13 Thread Andres Freund
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
 Overall, the main changes required in patch as per above feedback
 are:
 1. add an additional counter for the number of those
 allocations not satisfied from the free list, with a
 name like buffers_alloc_clocksweep.
 2. Autotune the low and high threshold values for buffers
 in freelist. In the patch, I have kept them as hard-coded
 values.
 3. For populating freelist, have a separate process (bgreclaim)
 instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-13 Thread Amit Kapila
On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund and...@2ndquadrant.com
wrote:

 On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
  Overall, the main changes required in patch as per above feedback
  are:
  1. add an additional counter for the number of those
  allocations not satisfied from the free list, with a
  name like buffers_alloc_clocksweep.
  2. Autotune the low and high threshold values for buffers
  in freelist. In the patch, I have kept them as hard-coded
  values.
  3. For populating freelist, have a separate process (bgreclaim)
  instead of doing it by bgwriter.

 I'm not convinced that 3) is the right way to go to be honest. Seems
 like a huge bandaid to me.

Doing both (populating freelist and flushing dirty buffers) via bgwriter
isn't the best way either because it might not be able to perform
both the jobs as per need.
One example is it could take much longer time to flush a dirty buffer
than to move it into free list, so if there are few buffers which we need
to flush, then I think task of maintaining buffers in freelist will get hit
even though there are buffers in list which can be moved to
free list(non-dirty buffers).
Another is maintaining the current behaviour of bgwriter which is to scan
the entire buffer pool every few mins (assuming default configuration).
We can attempt to solve this problem as suggested by Robert upthread
but I am not completely sure if that can guarantee that the current
behaviour will be retained as it is.

I am not telling that having a separate process won't have any issues,
but I think we can tackle them without changing or complicating current
bgwriter behaviour.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-08-12 Thread Andres Freund
On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
 On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  This essentially removes BgWriterDelay, but it's still mentioned in
 BgBufferSync().  Looking further, I see that with the patch applied,
 BgBufferSync() is still present in the source code but is no longer called
 from anywhere.  Please don't submit patches that render things unused
 without actually removing them; it makes it much harder to see what you've
 changed.  I realize you probably left it that way for testing purposes, but
 you need to clean such things up before submitting.  Likewise, if you've
 rendered GUCs or statistics counters removed, you need to rip them out, so
 that the scope of the changes you've made is clear to reviewers.

FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-12 Thread Amit Kapila
On Wed, Aug 13, 2014 at 2:32 AM, Andres Freund and...@2ndquadrant.com
wrote:
 On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
  On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com
wrote:
   This essentially removes BgWriterDelay, but it's still mentioned in
  BgBufferSync().  Looking further, I see that with the patch applied,
  BgBufferSync() is still present in the source code but is no longer
called
  from anywhere.  Please don't submit patches that render things unused
  without actually removing them; it makes it much harder to see what
you've
  changed.  I realize you probably left it that way for testing purposes,
but
  you need to clean such things up before submitting.  Likewise, if you've
  rendered GUCs or statistics counters removed, you need to rip them out,
so
  that the scope of the changes you've made is clear to reviewers.

 FWIW, I found this email amost unreadable because it misses quoting
 signs after linebreaks in quoted content.

I think I have done something wrong while replying to Robert's
mail, the main point in that mail was trying to see if there is any
major problem incase we have separate process (bgreclaim) to
populate freelist.   One thing which I thought could be problematic
is to put a buf in freelist which has usage_count as zero and is *dirty*.
Please do let me know if you want clarification for something in
particular.


Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

There are other things also which I need to take care as per
feedback like some change in locking strategy and code.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-08-06 Thread Amit Kapila
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
 This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync().  Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere.  Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed.  I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting.  Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

I have kept it just for the reason that if the basic approach is
sounds reasonable/accepted, then I will clean it up.  Sorry for
the inconvenience, I didn't realized that it can be annoying for
reviewer, I will remove all such code from patch in next version.


 A comparison of BgBufferSync() with
BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least
one behavior that some people (at least, me) will care about, which is the
guarantee that the background writer will scan the entire buffer pool at
least every couple of minutes.

Okay, I will take care of this based on the conclusion of
the other points in this mail.

This is important because it guarantees that dirty data doesn't sit in
memory forever.  When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk.  This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

 b.  New stats for number of buffers on freelist has been added, some
  old one's like maxwritten_clean can be removed as new logic for
  syncing buffers and moving them to free list doesn't use them.
  However I think it's better to remove them once the new logic is
  accepted.  Added some new logs for info related to free list under
  BGW_DEBUG


 If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist.  That makes no sense.

I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed.  However there can be cases where it is not
easy to estimate that number.

 I think what you should be counting is the number of allocations that are
being satisfied from the free-list.  Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run.  Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep.  We can then try to tune the patch to make
that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you).  Do you
want to retain such counters in code or it's for kind of debug info
for patch?



 d.  Autotune the low and high threshold for freelist for various
  configurations.

 I think we need to come up with some kind of formula here rather than
just a list of hard-coded constants.

That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads.  The current values I have choosen are based
on experiments for various workloads at different thresholds.  I have
shown the lwlock_stats data for various loads based on current
thresholds upthread.  Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.

There are other instances in code as well (one of them I remember
offhand is in pglz_compress) where we use fixed values based on
different sizes.

And it definitely needs some comments explaining the logic behind the
choices.

Agreed, I shall improve them in next version of patch.


 Aside from those specific remarks, I think the elephant in the room is
the question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk.  One problem, which I alluded to above under point (1), is 

Re: [HACKERS] Scaling shared buffer eviction

2014-08-06 Thread Robert Haas
On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 If I'm reading this right, the new statistic is an incrementing counter
 where, every time you update it, you add the number of buffers currently on
 the freelist.  That makes no sense.

 I think using 'number of buffers currently on the freelist' and
 'number of recently allocated buffers' for consecutive cycles,
 we can figure out approximately how many buffer allocations
 needs clock sweep assuming low and high threshold water
 marks are fixed.  However there can be cases where it is not
 easy to estimate that number.

Counters should be design in such a way that you can read it, and then
read it again later, and make sense of it - you should not need to
read the counter on *consecutive* cycles to interpret it.

 I think what you should be counting is the number of allocations that are
 being satisfied from the free-list.  Then, by comparing the rate at which
 that value is incrementing to the rate at which buffers_alloc is
 incrementing, somebody can figure out what percentage of allocations are
 requiring a clock-sweep run.  Actually, I think it's better to flip it
 around: count the number of allocations that require an individual backend
 to run the clock sweep (vs. being satisfied from the free-list); call it,
 say, buffers_backend_clocksweep.  We can then try to tune the patch to make
 that number as small as possible under varying workloads.

 This can give us clear idea to tune the patch, however we need
 to maintain 3 counters for it in code(recent_alloc (needed for
 current bgwriter logic) and other 2 suggested by you).  Do you
 want to retain such counters in code or it's for kind of debug info
 for patch?

I only mean to propose one new counter, and I'd imagine including that
in the final patch.  We already have a counter of total buffer
allocations; that's buffers_alloc.  I'm proposing to add an additional
counter for the number of those allocations not satisfied from the
free list, with a name like buffers_alloc_clocksweep (I said
buffers_backend_clocksweep above, but that's probably not best, as the
existing buffers_backend counts buffer *writes*, not allocations).  I
think we would definitely want to retain this counter in the final
patch, as an additional column in pg_stat_bgwriter.

 d.  Autotune the low and high threshold for freelist for various
  configurations.

 I think we need to come up with some kind of formula here rather than just
 a list of hard-coded constants.

 That was my initial intention as well and I have tried based
 on number of shared buffers like keeping threshold values as
 percentage of shared buffers but nothing could satisfy different
 kind of workloads.  The current values I have choosen are based
 on experiments for various workloads at different thresholds.  I have
 shown the lwlock_stats data for various loads based on current
 thresholds upthread.  Another way could be to make them as config
 knobs and use the values as given by user incase it is provided by
 user else go with fixed values.

How did you go about determining the optimal value for a particular workload?

When the list is kept short, it's less likely that a value on the list
will be referenced or dirtied again before the page is actually
recycled.  That's clearly good.  But when the list is long, it's less
likely to become completely empty and thereby force individual
backends to run the clock-sweep.  My suspicion is that, when the
number of buffers is small, the impact of the list being too short
isn't likely to be very significant, because running the clock-sweep
isn't all that expensive anyway - even if you have to scan through the
entire buffer pool multiple times, there aren't that many buffers.
But when the number of buffers is large, those repeated scans can
cause a major performance hit, so having an adequate pool of free
buffers becomes much more important.

I think your list of high-watermarks is far too generous for low
buffer counts.  With more than 100k shared buffers, you've got a
high-watermark of 2k buffers, which means that 2% or less of the
buffers will be on the freelist, which seems a little on the high side
to me, but probably in the ballpark of what we should be aiming for.
But at 10001 shared buffers, you can have 1000 of them on the
freelist, which is 10% of the buffer pool; that seems high.  At 101
shared buffers, 75% of the buffers in the system can be on the
freelist; that seems ridiculous.  The chances of a buffer still being
unused by the time it reaches the head of the freelist seem very
small.

Based on your existing list of thresholds, and taking the above into
account, I'd suggest something like this: let the high-watermark for
the freelist be 0.5% of the total number of buffers, with a maximum of
2000 and a minimum of 5.  Let the low-watermark be 20% of the
high-watermark.  That might not be best, but I think some kind of
formula like that can likely be made to work.  I would 

Re: [HACKERS] Scaling shared buffer eviction

2014-08-05 Thread Robert Haas
On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila amit.kapil...@gmail.com wrote:

 I have improved the patch  by making following changes:
 a. Improved the bgwriter logic to log for xl_running_xacts info and
 removed the hibernate logic as bgwriter will now work only when
 there is scarcity of buffer's in free list. Basic idea is when the
 number of buffers on freelist drops below the low threshold, the
 allocating backend sets the latch and bgwriter wakesup and begin

 adding buffer's to freelist until it reaches high threshold and then

 again goes back to sleep.


This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync().  Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere.  Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed.  I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting.  Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist()
reveals that you've removed at least one behavior that some people (at
least, me) will care about, which is the guarantee that the background
writer will scan the entire buffer pool at least every couple of minutes.
This is important because it guarantees that dirty data doesn't sit in
memory forever.  When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk.  This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

b.  New stats for number of buffers on freelist has been added, some
  old one's like maxwritten_clean can be removed as new logic for
  syncing buffers and moving them to free list doesn't use them.
  However I think it's better to remove them once the new logic is
  accepted.  Added some new logs for info related to free list under
  BGW_DEBUG


If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist.  That makes no sense.  I think what you should be counting is
the number of allocations that are being satisfied from the free-list.
Then, by comparing the rate at which that value is incrementing to the rate
at which buffers_alloc is incrementing, somebody can figure out what
percentage of allocations are requiring a clock-sweep run.  Actually, I
think it's better to flip it around: count the number of allocations that
require an individual backend to run the clock sweep (vs. being satisfied
from the free-list); call it, say, buffers_backend_clocksweep.  We can then
try to tune the patch to make that number as small as possible under
varying workloads.

c.  Used the already existing bgwriterLatch in BufferStrategyControl to
  wake bgwriter when number of buffer's in freelist drops below
  threshold.


Seems like a good idea.


 d.  Autotune the low and high threshold for freelist for various
  configurations.  Generally if keep small number (200~2000) of buffers
  always available on freelist, then even for high shared buffers
  like 15GB, it appears to be sufficient.  However when the value
  of shared buffer's is less, then we need much smaller number.  I
  think we can provide these as config knobs for user as well, but for
  now based on LWLOCK_STATS result, I have chosen some hard
  coded values for low and high threshold values for freelist.
  Values for low and high threshold have been decided based on total
  number of shared buffers, basically I have divided them into 5
  categories (16~100, 100~1000,  1000~1, 1~10,
  10 and above) and then ran tests(read-only pgbench) for various
  configurations falling under these categories.  The reason for keeping
  lesser categories for larger shared buffers is that if there are small
  number (200~2000) of buffers available on free list, then it seems to
  be sufficient for quite high loads, however as the total number of
 shared
  buffer's decreases we need to be more careful as if we keep the
 number as
  too low then it will lead to more clock sweep by backends (which means
  freelist lock contention) and if we keep number higher bgwriter will
 evict
  many useful buffers.  Results based on LWLOCK_STATS is at end of mail.


I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants.  And it definitely needs some comments
explaining the logic behind the choices.


Re: [HACKERS] Scaling shared buffer eviction

2014-06-25 Thread Amit Kapila
On Mon, Jun 9, 2014 at 9:33 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner kgri...@ymail.com wrote:
  Backend processes related to user connections still
  performed about 30% of the writes, and this work shows promise
  toward bringing that down, which would be great; but please don't
  eliminate the ability to prevent write stalls in the process.


 I am planing to take some more performance data, part of which will
 be write load as well, but I am now sure if that can anyway show the
 need as mentioned by you.

After taking the performance data for write load using tpc-b with the
patch, I found that there is a regression in it.  So I went ahead and
tried to figure out the reason for same and found that after patch,
Bgwriter started flushing buffers which were required by backends
and reason was that *nextVictimBuffer* was not getting updated
properly while we are running clock sweep kind of logic (decrement
the usage count when number of buffers on freelist fall below low
threshhold value) in Bgwriter.  In HEAD, I noticed that at default
settings, BGwriter was not at all flushing any buffers which is at least
better than what my patch was doing (flushing buffers required by
backend).

So I tried to fix the issue by updating *nextVictimBuffer* in new
BGWriter logic and results are positive.

sbe - scalable buffer eviction

Select only Data
Client count/TPS64128Un-patched4523217310sbe_v3111468114521sbe_v4153137
160752
TPC-B

Client count/TPS
64128Un-patched825784sbe_v4814845

For Select Data, I am quite confident that it will improve if we introduce
nextVictimBuffer increments in BGwriter and rather it scales much better
with that change, however for TPC-B, I am getting fluctuation in data,
so not sure it has eliminated the problem.  The main difference is that in
HEAD, BGwriter never increments nextVictimBuffer during syncing the
buffers, it just notes down the current setting before start and then
proceeds sequentially.

I think it will be good if we can have a new process for moving buffers to
free list due to below reasons:

a. while trying to move buffers to freelist, it should not block due
to in between write activity.
b. The writer should not increment nextVictimBuffer and maintain
the current logic.

One significant change in this version of patch is to use a separate
spin lock to protect nextVictimBuffer rather than using BufFreelistLock.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


scalable_buffer_eviction_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-06-10 Thread Robert Haas
On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner kgri...@ymail.com wrote:
 Amit Kapila amit.kapil...@gmail.com wrote:
 I have improved the patch  by making following changes:
 a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.

 The numbers from your benchmarks are very exciting, but the above
 concerns me.  My tuning of the bgwriter in production has generally
 *not* been aimed at keeping pages on the freelist,

Just to be clear, prior to this patch, the bgwriter has never been in
the business of putting pages on the freelist in the first place, so
it wouldn't have been possible for you to tune for that.

 Essentially I was able to tune the bgwriter so that a dirty page
 was always push out to the OS cache within three seconds, which led
 to a healthy balance of writes between the checkpoint process and
 the bgwriter. Backend processes related to user connections still
 performed about 30% of the writes, and this work shows promise
 toward bringing that down, which would be great; but please don't
 eliminate the ability to prevent write stalls in the process.

I think, as Amit says downthread, that the crucial design question
here is whether we need two processes, one to populate the freelist so
that regular backends don't need to run the clock sweep, and a second
to flush dirty buffers, or whether a single process can serve both
needs.  In favor of a single process, many people have commented that
the background writer doesn't seem to do much right now.  If the
process is mostly sitting around idle, then giving it more
responsibilities might be OK.  In favor of having a second process,
I'm a little concerned that if the background writer gets busy writing
a page, it might then be unavailable to populate the freelist until it
finishes, which might be a very long time relative to the buffer
allocation needs of other backends.  I'm not sure what the right
answer is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-06-08 Thread Kevin Grittner
Amit Kapila amit.kapil...@gmail.com wrote:

 I have improved the patch  by making following changes:
 a. Improved the bgwriter logic to log for xl_running_xacts info and
    removed the hibernate logic as bgwriter will now work only when
    there is scarcity of buffer's in free list. Basic idea is when the
    number of buffers on freelist drops below the low threshold, the
    allocating backend sets the latch and bgwriter wakesup and begin
    adding buffer's to freelist until it reaches high threshold and then
    again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me.  My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls.  By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-06-08 Thread Amit Kapila
On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner kgri...@ymail.com wrote:
 Amit Kapila amit.kapil...@gmail.com wrote:

  I have improved the patch  by making following changes:
  a. Improved the bgwriter logic to log for xl_running_xacts info and
 removed the hibernate logic as bgwriter will now work only when
 there is scarcity of buffer's in free list. Basic idea is when the
 number of buffers on freelist drops below the low threshold, the
 allocating backend sets the latch and bgwriter wakesup and begin
 adding buffer's to freelist until it reaches high threshold and then
 again goes back to sleep.

 The numbers from your benchmarks are very exciting, but the above
 concerns me.  My tuning of the bgwriter in production has generally
 *not* been aimed at keeping pages on the freelist, but toward
 preventing shared_buffers from accumulating a lot of dirty pages,
 which were leading to cascades of writes between caches and thus to
 write stalls.  By pushing dirty pages into the (*much* larger) OS
 cache, and letting write combining happen there, where the OS could
 pace based on the total number of dirty pages instead of having
 some hidden and appearing rather suddenly, latency spikes were
 avoided while not causing any noticeable increase in the number of
 OS writes to the RAID controller's cache.

 Essentially I was able to tune the bgwriter so that a dirty page
 was always push out to the OS cache within three seconds, which led
 to a healthy balance of writes between the checkpoint process and
 the bgwriter.

I think it would have been better if bgwriter does writes based on
the amount of buffer's that get dirtied to achieve the balance of
writes.

 Backend processes related to user connections still
 performed about 30% of the writes, and this work shows promise
 toward bringing that down, which would be great; but please don't
 eliminate the ability to prevent write stalls in the process.

I agree that for some cases as explained by you, the current bgwriter
logic does satisfy the need, however there are other cases as well
where actually it doesn't help much, one of such cases I am trying to
improve (ease backend buffer allocations) and other may be when
there is constant write activity for which I am not sure how much it
really helps. Part of the reason for trying to make bgwriter respond
mainly to ease backend allocations is the previous discussion for
the same, refer below link:
http://www.postgresql.org/message-id/ca+tgmoz7dvhc4h-ffjmzcff6vwynfoeapz021vxw61uh46r...@mail.gmail.com

However if we want to retain current property of bgwriter, we can do
the same by one of below ways:
a. Have separate processes for writing dirty buffers and moving buffers
to freelist.
b. In the current bgwriter, separate the two works based on the need.
The need can be decided based on whether bgwriter has been waked
due to shortage of buffers on free list or if it has been waked due to
BgWriterDelay.

Now as populating freelist and balance writes by writing dirty buffers
are two separate responsibilities, so not sure if doing that by one
process is a good idea.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-05-16 Thread Amit Kapila
On Thu, May 15, 2014 at 11:11 AM, Amit Kapila amit.kapil...@gmail.comwrote:


 Data with LWLOCK_STATS
 --
 BufMappingLocks

 PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
 PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
 PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
 PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
 PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
 PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
 PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
 PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
 PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
 PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
 PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
 PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
 PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
 PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
 PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
 PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
 PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
 PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
 PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

 This data shows that after patch, there is no contention
 for BufFreeListLock, rather there is a huge contention around
 BufMappingLocks.  I have checked that HEAD also has contention
 around BufMappingLocks.

 As per my analysis till now, I think reducing contention around
 BufFreelistLock is not sufficient to improve scalability, we need
 to work on reducing contention around BufMappingLocks as well.


To reduce the contention around BufMappingLocks, I have tried the patch
by just increasing the Number of Buffer Partitions, and it actually shows
a really significant increase in scalability both due to reduced contention
around BufFreeListLock and BufMappingLocks.  The real effect of reducing
contention around BufFreeListLock was hidden because the whole contention
was shifted to BufMappingLocks.  I have taken performance data for both
HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the
benefit of reducing contention around BufFreeListLock.  This data has been
taken using pgbench read only load (Select).

Performance Data
---
HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) +
   LOG2_NUM_LOCK_PARTITIONS(6))
V1 + 64  = PATCH + (NUM_BUFFER_PARTITONS(64) +
   LOG2_NUM_LOCK_PARTITIONS(6))
Similarly 128 means 128 buffer partitions

shared_buffers= 8GB
scale factor   = 3000
RAM - 64GB


Thrds (64) Thrds (128)  HEAD 45562 17128  HEAD + 64 57904 32810  V1 + 64
105557 81011  HEAD + 128 58383 32997  V1 + 128 110705 114544

shared_buffers= 8GB
scale factor   = 1000
RAM - 64GB


Thrds (64) Thrds (128)  HEAD 92142 31050  HEAD + 64 108120 86367  V1 + 64
117454 123429  HEAD + 128 107762 86902  V1 + 128 123641 124822


Observations
-
1. There is increase of upto 5 times in performance for data that can fit
in memory but not in shared buffers
2. Though there is a increase in performance by just increasing number
of buffer partitions, but it doesn't scale well (especially see the case
when partitions have increased to 128 from 64).


I have verified that contention has reduced around BufMappingLocks
by running the patch with LWLOCKS

BufFreeListLock
PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1


BufMappingLocks

PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0
PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0
PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0
PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0
..
..
PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0
PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0
PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0
PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0

We can clearly notice that the number for *blk* has reduced significantly
which shows that contention has reduced.


The patch is still in a shape to prove the merit of idea and I have just
changed the number of partitions so that if someone wants to verify
the performance for similar load, it can be done by just applying
the patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


scalable_buffer_eviction_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Scaling shared buffer eviction

2014-05-16 Thread Robert Haas
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila amit.kapil...@gmail.comwrote:



 Thrds (64) Thrds (128)  HEAD 45562 17128  HEAD + 64 57904 32810  V1 + 64
 105557 81011  HEAD + 128 58383 32997  V1 + 128 110705 114544


I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it.  The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

In terms of ameliorating contention on the buffer mapping locks, I think it
would be better to replace the whole buffer mapping table with something
different.  I started working on that almost 2 years ago, building a
hash-table that can be read without requiring any locks and written with,
well, less locking than what we have right now:

http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash

I never got quite as far as trying to hook that up to the buffer mapping
machinery, but maybe that would be worth doing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Scaling shared buffer eviction

2014-05-16 Thread Peter Geoghegan
On Fri, May 16, 2014 at 7:51 AM, Amit Kapila amit.kapil...@gmail.comwrote:

 shared_buffers= 8GB
 scale factor   = 3000
 RAM - 64GB


 Thrds (64) Thrds (128)  HEAD 45562 17128  HEAD + 64 57904 32810  V1 + 64
 105557 81011  HEAD + 128 58383 32997  V1 + 128 110705 114544

 shared_buffers= 8GB
 scale factor   = 1000
 RAM - 64GB


 Thrds (64) Thrds (128)  HEAD 92142 31050  HEAD + 64 108120 86367  V1 + 64
 117454 123429  HEAD + 128 107762 86902  V1 + 128 123641 124822



I'm having a little trouble following this. These figure are transactions
per second for a 300 second pgbench tpc-b run? What does Thrds denote?

-- 
Peter Geoghegan


Re: [HACKERS] Scaling shared buffer eviction

2014-05-16 Thread Amit Kapila
On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan p...@heroku.com wrote:


 On Fri, May 16, 2014 at 7:51 AM, Amit Kapila amit.kapil...@gmail.com
wrote:

 shared_buffers= 8GB
 scale factor   = 3000
 RAM - 64GB

 I'm having a little trouble following this. These figure are transactions
per second for a 300 second pgbench tpc-b run?

Yes, the figures are tps for a 300 second run.
It is for select-only transactions.

What does Thrds denote?
It denotes number of threads (-j in pgbench run)

I have used below statements to take data
./pgbench -c 64 -j 64 -T 300 -S  postgres
./pgbench -c 128 -j 128 -T 300 -S  postgres

The reason for posting the numbers for 64/128 threads is because we have
mainly concurrency bottleneck when the number of connections are higher
than CPU cores and I am using 16 cores, 64 hardware threads m/c.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Scaling shared buffer eviction

2014-05-16 Thread Amit Kapila
On Sat, May 17, 2014 at 6:02 AM, Robert Haas robertmh...@gmail.com wrote:

 I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it.  The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

 In terms of ameliorating contention on the buffer mapping locks, I think
it would be better to replace the whole buffer mapping table with something
different.

Is there anything bad except for may be increase in LWLocks with scaling
hash partitions w.r.t to shared buffers either by auto tuning or by having a
configuration knob.  I understand that it would be bit difficult for users
to
estimate the correct value of such a parameter, we can provide info about
its usage in docs such that if user increases shared buffers by 'X' (20
times)
of default value (128M), then consider increasing such partitions and it
should
be always power of 2 or does something similar to above internally in code.


I agree that may be even by having a reasonably good estimate of number of
partitions w.r.t shared buffers, we might not be able to eliminate the
contention
around BufMappingLocks, but I think the scalability we get by doing that is
not
bad either.

 I started working on that almost 2 years ago, building a hash-table that
can be read without requiring any locks and written with, well, less
locking than what we have right now:

I have still not read the complete code, but by just going through initial
file
header, it seems to me that it will be much better than current
implementation in terms of concurrency, by the way does such an
implementation can extend to reducing scalability for hash indexes as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Scaling shared buffer eviction

2014-05-14 Thread Amit Kapila
As mentioned previously about my interest in improving shared
buffer eviction especially by reducing contention around
BufFreelistLock, I would like to share my progress about the
same.

The test used for this work is mainly the case when all the
data doesn't fit in shared buffers, but does fit in memory.
It is mainly based on previous comparison done by Robert
for similar workload:
http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html

To start with, I have taken LWLOCK_STATS report to confirm
the contention around BufFreelistLock and the data for HEAD
is as follows:

M/c details
IBM POWER-7 16 cores, 64 hardware threads
RAM - 64GB
Test
scale factor = 3000
shared_buffers = 8GB
number_of_threads = 64
duration = 5mins
./pgbench -c 64 -j 64 -T 300 -S  postgres

LWLOCK_STATS data for BufFreeListLock
PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023

Here the high *blk* count for scale factor 3000 clearly shows
that to find a usable buffer when data doesn't fit in shared buffers
it has to wait.

To solve this issue, I have implemented a patch which makes
sure that there are always enough buffers on freelist such that
the need for backend to run clock-sweep is minimal, the
implementation idea is more or less same as discussed
previously in below thread, so I will explain it at end of mail.
http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kap...@huawei.com

LWLOCK_STATS data after Patch (test used is same as
used for HEAD):

BufFreeListLock
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

Here the low *exacq* and *blk* count shows that the need to
run clock sweep for backend has reduced significantly.

Performance Data
---
shared_buffers= 8GB
number of threads - 64
sc - scale factor

  sc  tps
Head300045569
Patch   300046457
Head100093037
Patch   100092711

Above data shows that there is no significant change in
performance or scalability even after the contention is
reduced significantly around BufFreelistLock.

I have analyzed the patch both with perf record and
LWLOCK_STATS, both indicates that there is a high
contention around BufMappingLocks.

Data With perf record -a -g
-

+  10.14%  swapper  [kernel.kallsyms]  [k]
.pseries_dedicated_idle_sleep
+   7.77% postgres  [kernel.kallsyms]  [k] ._raw_spin_lock
+   6.88% postgres  [kernel.kallsyms]  [k]
.function_trace_call
+   4.15%  pgbench  [kernel.kallsyms]  [k] .try_to_wake_up
+   3.20%  swapper  [kernel.kallsyms]  [k]
.function_trace_call
+   2.99%  pgbench  [kernel.kallsyms]  [k]
.function_trace_call
+   2.41% postgres  postgres   [.] AllocSetAlloc
+   2.38% postgres  [kernel.kallsyms]  [k] .try_to_wake_up
+   2.27%  pgbench  [kernel.kallsyms]  [k] ._raw_spin_lock
+   1.49% postgres  [kernel.kallsyms]  [k]
._raw_spin_lock_irq
+   1.36% postgres  postgres   [.]
AllocSetFreeIndex
+   1.09%  swapper  [kernel.kallsyms]  [k] ._raw_spin_lock
+   0.91% postgres  postgres   [.] GetSnapshotData
+   0.90% postgres  postgres   [.]
MemoryContextAllocZeroAligned


Expanded graph
--

-  10.14%  swapper  [kernel.kallsyms]  [k]
.pseries_dedicated_idle_sleep
   - .pseries_dedicated_idle_sleep
  - 10.13% .pseries_dedicated_idle_sleep
 - 10.13% .cpu_idle
- 10.00% .start_secondary
 .start_secondary_prolog
-   7.77% postgres  [kernel.kallsyms]  [k] ._raw_spin_lock
   - ._raw_spin_lock
  - 6.63% ._raw_spin_lock
 - 5.95% .double_rq_lock
- .load_balance
   - 5.95% .__schedule
  - .schedule
 - 3.27% .SyS_semtimedop
  .SyS_ipc
  syscall_exit
  semop
  PGSemaphoreLock
  LWLockAcquireCommon
- LWLockAcquire
   - 3.27% BufferAlloc
ReadBuffer_common
 - ReadBufferExtended
 - 3.27% ReadBuffer
- 2.73% ReleaseAndReadBuffer
   - 1.70% _bt_relandgetbuf
_bt_search
_bt_first
btgettuple


It shows BufferAlloc-LWLOCK as top contributor and we use
BufMappingLocks in BufferAlloc, I have checked other expanded
calls as well, StrategyGetBuffer is not present in top contributors.

Data with LWLOCK_STATS
--
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 

<    1   2