Re: [HACKERS] Move unused buffers to freelist

2014-02-07 Thread Jason Petersen
Bump.

I’m interested in many of the issues that were discussed in this thread. Was 
this patch ever wrapped up (I can’t find it in any CF), or did this thread die 
off?

—Jason

On Aug 6, 2013, at 12:18 AM, Amit Kapila amit.kap...@huawei.com wrote:

 On Friday, June 28, 2013 6:20 PM Robert Haas wrote:
 On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
 Currently it wakes up based on bgwriterdelay config parameter which
 is by
 default 200ms, so you means we should
 think of waking up bgwriter based on allocations and number of
 elements left
 in freelist?
 
 I think that's what Andres and I are proposing, yes.
 
 As per my understanding Summarization of points raised by you and
 Andres
 which this patch should address to have a bigger win:
 
 1. Bgwriter needs to be improved so that it can help in reducing
 usage count
 and finding next victim buffer
   (run the clock sweep and add buffers to the free list).
 
 Check.
 I think one way to handle it is that while moving buffers to freelist,
 if we find
 that there are not enough buffers (= high watermark) which have zero
 usage count,  
 then move through buffer list and reduce usage count. Now here I think
 it is important
 how do we find that how many times we should circulate the buffer list
 to reduce usage count.
 Currently I have kept it proportional to number of times it failed to
 move enough buffers to freelist.
 
 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
 are
 less.
 
 Check. The way to do this is to keep a variable in shared memory in
 the same cache line as the spinlock protecting the freelist, and
 update it when you update the free list.
 
 
  Added a new variable freelistLatch in BufferStrategyControl
 
 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
   (a spinlock for the freelist, and an lwlock for the clock sweep).
 
 Check.
 
 Added a new variable freelist_lck in BufferStrategyControl which will be
 used to protect freelist.
 Still Buffreelist will be used to protect clock sweep part of
 StrategyGetBuffer.
 
 
 
 4. Separate processes for writing dirty buffers and moving buffers to
 freelist
 
 I think this part might be best pushed to a separate patch, although I
 agree we probably need it.
 
 5. Bgwriter needs to be more aggressive, logic based on which it
 calculates
 how many buffers it needs to process needs to be improved.
 
 This is basically overlapping with points already made.  I suspect we
 could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and
 bgwriter_lru_multiplier altogether.  The background writer would just
 have a high and a low watermark.  When the number of buffers on the
 freelist drops below the low watermark, the allocating backend sets
 the latch and bgwriter wakes up and begins adding buffers to the
 freelist.  When the number of buffers on the free list reaches the
 high watermark, the background writer goes back to sleep.  Some
 experimentation might be needed to figure out what values are
 appropriate for those watermarks.  In theory this could be a
 configuration knob, but I suspect it's better to just make the system
 tune it right automatically.
 
 Currently in Patch I have used low watermark as 1/6 and high watermark as
 1/3 of NBuffers.
 Values are hardcoded for now, but I will change to guc's or hash defines.
 As far as I can think there is no way to find number of buffers on freelist,
 so I had added one more variable to maintain it.
 Initially I thought that I could use existing variables firstfreebuffer and
 lastfreebuffer to calculate it, but it may not be accurate as
 once the buffers are moved to freelist, these don't give exact count.
 
 The main doubt here is what if after traversing all buffers, it didn't find
 enough buffers to meet 
 high watermark?
 
 Currently I just move out of loop to move buffers and just try to reduce
 usage count as explained in point-1
 
 6. There can be contention around buffer mapping locks, but we can
 focus on
 it later
 7. cacheline bouncing around the buffer header spinlocks, is there
 anything
 we can do to reduce this?
 
 I think these are points that we should leave for the future.
 
 This is just a WIP patch. I have kept older code in comments. I need to
 further refine it and collect performance data.
 I had prepared one script (perf_buff_mgmt.sh) to collect performance data
 for different shared buffers/scalefactor/number_of_clients
 
 Top level points which still needs to be taken care:
 1. Choose Optimistically used buffer in StrategyGetBuffer(). Refer Simon's
 Patch:
   https://commitfest.postgresql.org/action/patch_view?id=743
 2. Don't bump the usage count on every time buffer is pinned. This idea I
 got when reading archives about 
   improvements in this area.
 
 With Regards,
 Amit Kapila.
 changed_freelist_mgmt.patchperf_buff_mgmt.sh



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Move unused buffers to freelist

2014-02-07 Thread Amit Kapila
On Sat, Feb 8, 2014 at 7:16 AM, Jason Petersen ja...@citusdata.com wrote:
 Bump.

 I'm interested in many of the issues that were discussed in this thread. Was 
 this patch ever wrapped up (I can't find it in any CF), or did this thread 
 die off?

This and variant of this patch have been discussed multiple times, some
of the CF entries are as below:

Recent
https://commitfest.postgresql.org/action/patch_view?id=1113

Previous
https://commitfest.postgresql.org/action/patch_view?id=932

The main thing about this idea is to arrive with tests/scenario's where we can
show the benefit of this patch. I didn't get time during 9.4 to work
on this again,
but might work on it in next version, if you could help with some scenarios/test
where this patch can show benefit, it would be really good.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-08-06 Thread Amit Kapila
On Friday, June 28, 2013 6:20 PM Robert Haas wrote:
 On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Currently it wakes up based on bgwriterdelay config parameter which
 is by
  default 200ms, so you means we should
  think of waking up bgwriter based on allocations and number of
 elements left
  in freelist?
 
 I think that's what Andres and I are proposing, yes.
 
  As per my understanding Summarization of points raised by you and
 Andres
  which this patch should address to have a bigger win:
 
  1. Bgwriter needs to be improved so that it can help in reducing
 usage count
  and finding next victim buffer
 (run the clock sweep and add buffers to the free list).
 
 Check.
 I think one way to handle it is that while moving buffers to freelist,
if we find
 that there are not enough buffers (= high watermark) which have zero
usage count,  
 then move through buffer list and reduce usage count. Now here I think
it is important
 how do we find that how many times we should circulate the buffer list
to reduce usage count.
 Currently I have kept it proportional to number of times it failed to
move enough buffers to freelist.
 
  2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
 are
  less.
 
 Check. The way to do this is to keep a variable in shared memory in
 the same cache line as the spinlock protecting the freelist, and
 update it when you update the free list.


  Added a new variable freelistLatch in BufferStrategyControl

  3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
 (a spinlock for the freelist, and an lwlock for the clock sweep).
 
 Check.

 Added a new variable freelist_lck in BufferStrategyControl which will be
used to protect freelist.
 Still Buffreelist will be used to protect clock sweep part of
StrategyGetBuffer.

 

  4. Separate processes for writing dirty buffers and moving buffers to
  freelist
 
 I think this part might be best pushed to a separate patch, although I
 agree we probably need it.
 
  5. Bgwriter needs to be more aggressive, logic based on which it
 calculates
  how many buffers it needs to process needs to be improved.
 
 This is basically overlapping with points already made.  I suspect we
 could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and
 bgwriter_lru_multiplier altogether.  The background writer would just
 have a high and a low watermark.  When the number of buffers on the
 freelist drops below the low watermark, the allocating backend sets
 the latch and bgwriter wakes up and begins adding buffers to the
 freelist.  When the number of buffers on the free list reaches the
 high watermark, the background writer goes back to sleep.  Some
 experimentation might be needed to figure out what values are
 appropriate for those watermarks.  In theory this could be a
 configuration knob, but I suspect it's better to just make the system
 tune it right automatically.

Currently in Patch I have used low watermark as 1/6 and high watermark as
1/3 of NBuffers.
Values are hardcoded for now, but I will change to guc's or hash defines.
As far as I can think there is no way to find number of buffers on freelist,
so I had added one more variable to maintain it.
Initially I thought that I could use existing variables firstfreebuffer and
lastfreebuffer to calculate it, but it may not be accurate as
once the buffers are moved to freelist, these don't give exact count.

The main doubt here is what if after traversing all buffers, it didn't find
enough buffers to meet 
high watermark?

Currently I just move out of loop to move buffers and just try to reduce
usage count as explained in point-1

  6. There can be contention around buffer mapping locks, but we can
 focus on
  it later
  7. cacheline bouncing around the buffer header spinlocks, is there
 anything
  we can do to reduce this?
 
 I think these are points that we should leave for the future.

This is just a WIP patch. I have kept older code in comments. I need to
further refine it and collect performance data.
I had prepared one script (perf_buff_mgmt.sh) to collect performance data
for different shared buffers/scalefactor/number_of_clients

Top level points which still needs to be taken care:
1. Choose Optimistically used buffer in StrategyGetBuffer(). Refer Simon's
Patch:
   https://commitfest.postgresql.org/action/patch_view?id=743
2. Don't bump the usage count on every time buffer is pinned. This idea I
got when reading archives about 
   improvements in this area.

With Regards,
Amit Kapila.


changed_freelist_mgmt.patch
Description: Binary data


perf_buff_mgmt.sh
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-07-03 Thread Simon Riggs
On 28 June 2013 05:52, Amit Kapila amit.kap...@huawei.com wrote:


 As per my understanding Summarization of points raised by you and Andres
 which this patch should address to have a bigger win:

 1. Bgwriter needs to be improved so that it can help in reducing usage
 count
 and finding next victim buffer
(run the clock sweep and add buffers to the free list).
 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are
 less.
 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
(a spinlock for the freelist, and an lwlock for the clock sweep).
 4. Separate processes for writing dirty buffers and moving buffers to
 freelist
 5. Bgwriter needs to be more aggressive, logic based on which it calculates
 how many buffers it needs to process needs to be improved.
 6. There can be contention around buffer mapping locks, but we can focus on
 it later
 7. cacheline bouncing around the buffer header spinlocks, is there anything
 we can do to reduce this?


My perspectives here would be

* BufFreelistLock is a huge issue. Finding a next victim block needs to be
an O(1) operation, yet it is currently much worse than that. Measuring
contention on that lock hides that problem, since having shared buffers
lock up for 100ms or more but only occasionally is a huge problem, even if
it doesn't occur frequently enough for the averaged contention to show as
an issue.

* I'm more interested in reducing response time spikes than in increasing
throughput. It's easy to overload a benchmark so we get better throughput
numbers, but that's not helpful if we make the system more bursty.

* bgwriter's effectiveness is not guaranteed. We have many clear cases
where it is useless. So the question should be to continually answer the
question: do we need a bgwriter and if so, what should it do? The fact we
have one already doesn't mean it should be given things to do. It is a
possible option that things may be better if it did nothing. (Not saying
that is true, just that we must consider that optione ach time).

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Move unused buffers to freelist

2013-07-03 Thread Amit Kapila
On Wednesday, July 03, 2013 12:27 PM Simon Riggs wrote:
On 28 June 2013 05:52, Amit Kapila amit.kap...@huawei.com wrote:
 
 As per my understanding Summarization of points raised by you and Andres
 which this patch should address to have a bigger win:

 1. Bgwriter needs to be improved so that it can help in reducing usage
count
 and finding next victim buffer
   (run the clock sweep and add buffers to the free list).
2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are
less.
3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
   (a spinlock for the freelist, and an lwlock for the clock sweep).
4. Separate processes for writing dirty buffers and moving buffers to
freelist
5. Bgwriter needs to be more aggressive, logic based on which it
calculates
how many buffers it needs to process needs to be improved.
6. There can be contention around buffer mapping locks, but we can focus
on
it later
7. cacheline bouncing around the buffer header spinlocks, is there
anything
we can do to reduce this?

My perspectives here would be

 * BufFreelistLock is a huge issue. Finding a next victim block needs to be
an O(1) operation, yet it is currently much worse than that. Measuring 
 contention on that lock hides that problem, since having shared buffers
lock up for 100ms or more but only occasionally is a huge problem, even if
it 
 doesn't occur frequently enough for the averaged contention to show as an
issue.

  To optimize finding next victim buffer, I am planning to run the clock
sweep in background. Apart from that do you have any idea to make it closer
to O(1)?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-07-03 Thread Simon Riggs
On 3 July 2013 12:56, Amit Kapila amit.kap...@huawei.com wrote:


 My perspectives here would be

  * BufFreelistLock is a huge issue. Finding a next victim block needs to
 be
 an O(1) operation, yet it is currently much worse than that. Measuring
  contention on that lock hides that problem, since having shared buffers
 lock up for 100ms or more but only occasionally is a huge problem, even if
 it
  doesn't occur frequently enough for the averaged contention to show as an
 issue.

   To optimize finding next victim buffer, I am planning to run the clock
 sweep in background. Apart from that do you have any idea to make it closer
 to O(1)?


Yes, I already posted patches to attentuate the search time. Please check
back last few CFs of 9.3

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Move unused buffers to freelist

2013-07-03 Thread Amit Kapila
On Wednesday, July 03, 2013 6:10 PM Simon Riggs wrote:
On 3 July 2013 12:56, Amit Kapila amit.kap...@huawei.com wrote:
 
My perspectives here would be

 * BufFreelistLock is a huge issue. Finding a next victim block needs to
be
an O(1) operation, yet it is currently much worse than that. Measuring
 contention on that lock hides that problem, since having shared buffers
lock up for 100ms or more but only occasionally is a huge problem, even if
it
 doesn't occur frequently enough for the averaged contention to show as
an
issue.
  To optimize finding next victim buffer, I am planning to run the clock
 sweep in background. Apart from that do you have any idea to make it
closer
 to O(1)?

 Yes, I already posted patches to attentuate the search time. Please check
back last few CFs of 9.3

Okay, I got it. I think you mean 9.2. 

Patch: Reduce locking on StrategySyncStart() 
https://commitfest.postgresql.org/action/patch_view?id=743 


Patch: Reduce freelist locking during DROP TABLE/DROP DATABASE 
https://commitfest.postgresql.org/action/patch_view?id=744

I shall pay attention to patches and the discussion during my work on
enhancement of this patch.


With Regards,
Amit Kapila.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-07-02 Thread Amit Kapila
On Tuesday, July 02, 2013 12:00 AM Robert Haas wrote:
 On Sun, Jun 30, 2013 at 3:24 AM, Amit kapila amit.kap...@huawei.com
 wrote:
  Do you think it will be sufficient to just wake bgwriter when the
 buffers in freelist drops
  below low watermark, how about it's current job of flushing dirty
 buffers?
 
 Well, the only point of flushing dirty buffers in the background
 writer is to make sure that backends can allocate buffers quickly.  If
 there are clean buffers already in the freelist, that's not a concern.
  So...
 
  I mean to ask that if for some scenario where there are sufficient
 buffers in freelist, but most
  other buffers are dirty, will delaying flush untill number of buffers
 fall below low watermark is okay.
 
 ...I think this is OK, or at least we should assume it's OK until we
 have evidence that it isn't.

Sure, after completing my other review work of Commit Fest, I will devise
the solution
for the suggestions summarized in previous mail and then start a discussion
about same.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-07-01 Thread Robert Haas
On Sun, Jun 30, 2013 at 3:24 AM, Amit kapila amit.kap...@huawei.com wrote:
 Do you think it will be sufficient to just wake bgwriter when the buffers in 
 freelist drops
 below low watermark, how about it's current job of flushing dirty buffers?

Well, the only point of flushing dirty buffers in the background
writer is to make sure that backends can allocate buffers quickly.  If
there are clean buffers already in the freelist, that's not a concern.
 So...

 I mean to ask that if for some scenario where there are sufficient buffers in 
 freelist, but most
 other buffers are dirty, will delaying flush untill number of buffers fall 
 below low watermark is okay.

...I think this is OK, or at least we should assume it's OK until we
have evidence that it isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-30 Thread Amit kapila
On Friday, June 28, 2013 6:20 PM Robert Haas wrote:
On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Currently it wakes up based on bgwriterdelay config parameter which is by
 default 200ms, so you means we should
 think of waking up bgwriter based on allocations and number of elements left
 in freelist?

 I think that's what Andres and I are proposing, yes.

 As per my understanding Summarization of points raised by you and Andres
 which this patch should address to have a bigger win:

 1. Bgwriter needs to be improved so that it can help in reducing usage count
 and finding next victim buffer
(run the clock sweep and add buffers to the free list).

 Check.

 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are
 less.

Check.  The way to do this is to keep a variable in shared memory in
the same cache line as the spinlock protecting the freelist, and
update it when you update the free list.

 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
(a spinlock for the freelist, and an lwlock for the clock sweep).

Check.

 4. Separate processes for writing dirty buffers and moving buffers to
 freelist

 I think this part might be best pushed to a separate patch, although I
 agree we probably need it.

 5. Bgwriter needs to be more aggressive, logic based on which it calculates
 how many buffers it needs to process needs to be improved.

 This is basically overlapping with points already made.  I suspect we
 could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and
 bgwriter_lru_multiplier altogether.  The background writer would just
 have a high and a low watermark.  When the number of buffers on the
 freelist drops below the low watermark, the allocating backend sets
 the latch and bgwriter wakes up and begins adding buffers to the
 freelist.  When the number of buffers on the free list reaches the
 high watermark, the background writer goes back to sleep.  Some
 experimentation might be needed to figure out what values are
 appropriate for those watermarks.  In theory this could be a
 configuration knob, but I suspect it's better to just make the system
 tune it right automatically.

Do you think it will be sufficient to just wake bgwriter when the buffers in 
freelist drops
below low watermark, how about it's current job of flushing dirty buffers?

I mean to ask that if for some scenario where there are sufficient buffers in 
freelist, but most
other buffers are dirty, will delaying flush untill number of buffers fall 
below low watermark is okay.

 6. There can be contention around buffer mapping locks, but we can focus on
 it later
 7. cacheline bouncing around the buffer header spinlocks, is there anything
 we can do to reduce this?

 I think these are points that we should leave for the future.

with Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-30 Thread Amit kapila

On Friday, June 28, 2013 6:38 PM Robert Haas wrote:
On Fri, Jun 28, 2013 at 8:50 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Currently it wakes up based on bgwriterdelay config parameter which is by
 default 200ms, so you means we should
 think of waking up bgwriter based on allocations and number of elements left
 in freelist?

 I think that's what Andres and I are proposing, yes.

 Incidentally, I'm going to mark this patch Returned with Feedback in
the CF application.  

Many thanks to you and Andres for providing valuable suggestions.

I think this line of inquiry has potential, but
clearly there's a lot more work to do here before we commit anything,
and I don't think that's going to happen in the next few weeks.  But
let's keep discussing.

Sure.

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-28 Thread Robert Haas
On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Currently it wakes up based on bgwriterdelay config parameter which is by
 default 200ms, so you means we should
 think of waking up bgwriter based on allocations and number of elements left
 in freelist?

I think that's what Andres and I are proposing, yes.

 As per my understanding Summarization of points raised by you and Andres
 which this patch should address to have a bigger win:

 1. Bgwriter needs to be improved so that it can help in reducing usage count
 and finding next victim buffer
(run the clock sweep and add buffers to the free list).

Check.

 2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are
 less.

Check.  The way to do this is to keep a variable in shared memory in
the same cache line as the spinlock protecting the freelist, and
update it when you update the free list.

 3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
(a spinlock for the freelist, and an lwlock for the clock sweep).

Check.

 4. Separate processes for writing dirty buffers and moving buffers to
 freelist

I think this part might be best pushed to a separate patch, although I
agree we probably need it.

 5. Bgwriter needs to be more aggressive, logic based on which it calculates
 how many buffers it needs to process needs to be improved.

This is basically overlapping with points already made.  I suspect we
could just get rid of bgwriter_delay, bgwriter_lru_maxpages, and
bgwriter_lru_multiplier altogether.  The background writer would just
have a high and a low watermark.  When the number of buffers on the
freelist drops below the low watermark, the allocating backend sets
the latch and bgwriter wakes up and begins adding buffers to the
freelist.  When the number of buffers on the free list reaches the
high watermark, the background writer goes back to sleep.  Some
experimentation might be needed to figure out what values are
appropriate for those watermarks.  In theory this could be a
configuration knob, but I suspect it's better to just make the system
tune it right automatically.

 6. There can be contention around buffer mapping locks, but we can focus on
 it later
 7. cacheline bouncing around the buffer header spinlocks, is there anything
 we can do to reduce this?

I think these are points that we should leave for the future.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-28 Thread Robert Haas
On Fri, Jun 28, 2013 at 8:50 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Currently it wakes up based on bgwriterdelay config parameter which is by
 default 200ms, so you means we should
 think of waking up bgwriter based on allocations and number of elements left
 in freelist?

 I think that's what Andres and I are proposing, yes.

Incidentally, I'm going to mark this patch Returned with Feedback in
the CF application.  I think this line of inquiry has potential, but
clearly there's a lot more work to do here before we commit anything,
and I don't think that's going to happen in the next few weeks.  But
let's keep discussing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-28 Thread Greg Smith

On 6/28/13 8:50 AM, Robert Haas wrote:

On Fri, Jun 28, 2013 at 12:52 AM, Amit Kapila amit.kap...@huawei.com wrote:

4. Separate processes for writing dirty buffers and moving buffers to
freelist


I think this part might be best pushed to a separate patch, although I
agree we probably need it.


This might be necessary eventually, but it's going to make thing more 
complicated.  And I don't think it's a blocker for creating something 
useful.  The two most common workloads are:


1) Lots of low usage count data, typically data that is updated sparsely 
across a larger database.  These are helped by a process that writes 
dirty buffers in the background.  These benefit from the current 
background writer.  Kevin's system he was just mentioning again is the 
best example of this type that there's public data on.


2) Lots of high usage count data, because there are large hotspots in 
things like index blocks.  Most writes happen at checkpoint time, 
because the background writer won't touch them.  Because there are only 
a small number of re-usable pages, the clock sweep goes around very fast 
looking for them.  This is the type of workload that should benefit from 
putting buffers into the free list.  pgbench provides a simple example 
of this type, which is why Amit's tests using it have been useful.


If you had a process that tried to handle both background writes and 
freelist management, I suspect one path would be hot and the other 
almost idle in each type of system.  I don't expect that splitting those 
into two separate process would buy a lot of value, that can easily be 
pushed to a later patch.



The background writer would just
have a high and a low watermark.  When the number of buffers on the
freelist drops below the low watermark, the allocating backend sets
the latch and bgwriter wakes up and begins adding buffers to the
freelist.  When the number of buffers on the free list reaches the
high watermark, the background writer goes back to sleep.


This will work fine for all of the common workloads.  The main challenge 
is keeping the buffer allocation counting from turning into a hotspot. 
Busy systems now can easily hit 100K buffer allocations/second.  I'm not 
too worried about it because those allocations are making the free list 
lock a hotspot right now.


One of the consistently controversial parts of the current background 
writer is how it tries to loop over the buffer cache every 2 minutes, 
regardless of activity level.  The idea there was that on bursty 
workloads, buffers would be cleaned during idle periods with that 
mechanism.  Part of why that's in there is to deal with the relatively 
long pause between background writer runs.


This refactoring idea will make that hard to keep around.  I think this 
is OK though.  Switching to a latch based design should eliminate the 
bgwriter_delay, which means you won't have this worst case of a 200ms 
stall while heavy activity is incoming.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-28 Thread Robert Haas
On Fri, Jun 28, 2013 at 12:10 PM, Greg Smith g...@2ndquadrant.com wrote:
 This refactoring idea will make that hard to keep around.  I think this is
 OK though.  Switching to a latch based design should eliminate the
 bgwriter_delay, which means you won't have this worst case of a 200ms stall
 while heavy activity is incoming.

I'm a strong proponent of that 2 minute cycle, so I'd vote for finding
a way to keep it around.  But I don't think that (or 200 ms wakeups)
should be the primary thing driving the background writer, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 8:09 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Configuration Details
 O/S - Suse-11
 RAM - 128GB
 Number of Cores - 16
 Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
 synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench -
 Select-only Scalefactor - 1200 Time - 30 mins

  8C-8T16C-16T32C-32T64C-64T
 Head   62403101810 99516  94707
 Patch  62827101404 99109  94744

 On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared
 buffers, this is no major difference.
 One of the reasons could be that there is no much swapping in shared buffers
 as most data already fits in shared buffers.

I'd like to just back up a minute here and talk about the broader
picture here.  What are we trying to accomplish with this patch?  Last
year, I did some benchmarking on a big IBM POWER7 machine (16 cores,
64 hardware threads).  Here are the results:

http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

Now, if you look at these results, you see something interesting.
When there aren't too many concurrent connections, the higher scale
factors are only modestly slower than the lower scale factors.  But as
the number of connections increases, the performance continues to rise
at the lower scale factors, and at the higher scale factors, this
performance stops rising and in fact drops off.  So in other words,
there's no huge *performance* problem for a working set larger than
shared_buffers, but there is a huge *scalability* problem.  Now why is
that?

As far as I can tell, the answer is that we've got a scalability
problem around BufFreelistLock.  Contention on the buffer mapping
locks may also be a problem, but all of my previous benchmarking (with
LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
in the room.  My interest in having the background writer add buffers
to the free list is basically around solving that problem.  It's a
pretty dramatic problem, as the graph above shows, and this patch
doesn't solve it.  There may be corner cases where this patch improves
things (or, equally, makes them worse) but as a general point, the
difficulty I've had reproducing your test results and the specificity
of your instructions for reproducing them suggests to me that what we
have here is not a clear improvement on general workloads.  Yet such
an improvement should exist, because there are other products in the
world that have scalable buffer managers; we currently don't.  Instead
of spending a lot of time trying to figure out whether there's a small
win in narrow cases here (and there may well be), I think we should
back up and ask why this isn't a great big win, and what we'd need to
do to *get* a great big win.  I don't see much point in tinkering
around the edges here if things are broken in the middle; things that
seem like small wins or losses now may turn out otherwise in the face
of a more comprehensive solution.

One thing that occurred to me while writing this note is that the
background writer doesn't have any compelling reason to run on a
read-only workload.  It will still run at a certain minimum rate, so
that it cycles the buffer pool every 2 minutes, if I remember
correctly.  But it won't run anywhere near fast enough to keep up with
the buffer allocation demands of 8, or 32, or 64 sessions all reading
data not all of which is in shared_buffers at top speed.  In fact,
we've had reports that the background writer isn't too effective even
on read-write workloads.  The point is - if the background writer
isn't waking up and running frequently enough, what it does when it
does wake up isn't going to matter very much.  I think we need to
spend some energy poking at that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Andres Freund
On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
 I'd like to just back up a minute here and talk about the broader
 picture here.

Sounds like a very good plan.

 So in other words,
 there's no huge *performance* problem for a working set larger than
 shared_buffers, but there is a huge *scalability* problem.  Now why is
 that?

 As far as I can tell, the answer is that we've got a scalability
 problem around BufFreelistLock.

Part of the problem is it's name ;)

 Contention on the buffer mapping
 locks may also be a problem, but all of my previous benchmarking (with
 LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
 in the room.

Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.

 My interest in having the background writer add buffers
 to the free list is basically around solving that problem.  It's a
 pretty dramatic problem, as the graph above shows, and this patch
 doesn't solve it.

 One thing that occurred to me while writing this note is that the
 background writer doesn't have any compelling reason to run on a
 read-only workload.  It will still run at a certain minimum rate, so
 that it cycles the buffer pool every 2 minutes, if I remember
 correctly.

I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.

I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.

Problems with the current code:

* doesn't manipulate the usage_count and never does anything to used
  pages. Which means it will just about never find a victim buffer in a
  busy database.
* by far not aggressive enough, touches only a few buffers ahead of the
  clock sweep.
* does not advance the clock sweep, so the individual backends will
  touch the same buffers again and transfer all the buffer spinlock
  cacheline around
* The adaption logic it has makes it so slow to adapt that it takes
  several minutes to adapt.
* ...


There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...
// check freelist, will usually be empty
...
for (;;)
{
buf = BufferDescriptors[StrategyControl-nextVictimBuffer];

++StrategyControl-nextVictimBuffer;

LockBufHdr(buf);
if (buf-refcount == 0)
{
if (buf-usage_count  0)
{
buf-usage_count--;
}
else
{
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
}
UnlockBufHdr(buf);
}

So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction  bouncing.

It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.

I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund and...@2ndquadrant.com wrote:
 Contention wise I aggree. What I have seen is that we have a huge
 amount of cacheline bouncing around the buffer header spinlocks.

How did you measure that?

 I have previously added some adhoc instrumentation that printed the
 amount of buffers that were required (by other backends) during a
 bgwriter cycle and the amount of buffers that the buffer manager could
 actually write out.

I think you can see how many are needed from buffers_alloc.  No?

 I don't think I actually found any workload where
 the bgwriter actually wroute out a relevant percentage of the necessary
 pages.

Check.

 Problems with the current code:

 * doesn't manipulate the usage_count and never does anything to used
   pages. Which means it will just about never find a victim buffer in a
   busy database.

Right.  I was thinking that was part of this patch, but it isn't.  I
think we should definitely add that.  In other words, the background
writer's job should be to run the clock sweep and add buffers to the
free list.  I think we should also split the lock: a spinlock for the
freelist, and an lwlock for the clock sweep.

 * by far not aggressive enough, touches only a few buffers ahead of the
   clock sweep.

Check.  Fixing this might be a separate patch, but then again maybe
not.  The changes we're talking about here provide a natural feedback
mechanism: if we observe that the freelist is empty (or less than some
length, like 32 buffers?) set the background writer's latch, because
we know it's not keeping up.

 * does not advance the clock sweep, so the individual backends will
   touch the same buffers again and transfer all the buffer spinlock
   cacheline around

Yes, I think that should be fixed as part of this patch too.  It's
obviously connected to the point about usage counts.

 * The adaption logic it has makes it so slow to adapt that it takes
   several minutes to adapt.

Yeah.  I don't know if fixing that will fall naturally out of these
other changes or not, but I think it's a second-order concern in any
event.

 There's another thing we could do to noticeably improve scalability of
 buffer acquiration. Currently we do a huge amount of work under the
 freelist lock.
 In StrategyGetBuffer:
 LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
 ...
 // check freelist, will usually be empty
 ...
 for (;;)
 {
 buf = BufferDescriptors[StrategyControl-nextVictimBuffer];

 ++StrategyControl-nextVictimBuffer;

 LockBufHdr(buf);
 if (buf-refcount == 0)
 {
 if (buf-usage_count  0)
 {
 buf-usage_count--;
 }
 else
 {
 /* Found a usable buffer */
 if (strategy != NULL)
 AddBufferToRing(strategy, buf);
 return buf;
 }
 }
 UnlockBufHdr(buf);
 }

 So, we perform the entire clock sweep until we found a single buffer we
 can use inside a *global* lock. At times we need to iterate over the
 whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
 the usage counts enough (if the database is busy it can take even
 longer...).
 In a busy database where usually all the usagecounts are high the next
 backend will touch a lot of those buffers again which causes massive
 cache eviction  bouncing.

 It seems far more sensible to only protect the clock sweep's
 nextVictimBuffer with a spinlock. With some care all the rest can happen
 without any global interlock.

That's a lot more spinlock acquire/release cycles, but it might work
out to a win anyway.  Or it might lead to the system suffering a
horrible spinlock-induced death spiral on eviction-heavy workloads.

 I think even after fixing this - which we definitely should do - having
 a sensible/more aggressive bgwriter moving pages onto the freelist makes
 sense because then backends then don't need to deal with dirty pages.

Or scanning to find evictable pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Andres Freund
On 2013-06-27 09:50:32 -0400, Robert Haas wrote:
 On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund and...@2ndquadrant.com wrote:
  Contention wise I aggree. What I have seen is that we have a huge
  amount of cacheline bouncing around the buffer header spinlocks.
 
 How did you measure that?

perf record -e cache-misses. If you want it more detailed looking at
{L1,LLC}-{load,store}{s,misses} can sometimes be helpful too.
Also, running perf stat -vvv postgres -D ... for a whole benchmark can
be useful to compare how much a change influences cache misses and such.

For very detailed analysis running something under valgrind/cachegrind
can be helpful too, but I usually find perf to be sufficient.

  I have previously added some adhoc instrumentation that printed the
  amount of buffers that were required (by other backends) during a
  bgwriter cycle and the amount of buffers that the buffer manager could
  actually write out.
 
 I think you can see how many are needed from buffers_alloc.  No?

Not easily correlated with bgwriter activity. If we cannot keep up
because it's 100% busy writing out buffers I don't have many problems
with that. But I don't think we often are.

  Problems with the current code:
 
  * doesn't manipulate the usage_count and never does anything to used
pages. Which means it will just about never find a victim buffer in a
busy database.
 
 Right.  I was thinking that was part of this patch, but it isn't.  I
 think we should definitely add that.  In other words, the background
 writer's job should be to run the clock sweep and add buffers to the
 free list.

We might need to split it into two for that. One process to writeout
dirty pages, one to populate the freelist.
Otherwise we will probably regularly hit the current scalability issues
because we're currently io contended. Say during a busy or even
immediate checkpoint.

  I think we should also split the lock: a spinlock for the
 freelist, and an lwlock for the clock sweep.

Yea, thought about that when writing the thing about the exclusive lock
during the clocksweep.

  * by far not aggressive enough, touches only a few buffers ahead of the
clock sweep.
 
 Check.  Fixing this might be a separate patch, but then again maybe
 not.  The changes we're talking about here provide a natural feedback
 mechanism: if we observe that the freelist is empty (or less than some
 length, like 32 buffers?) set the background writer's latch, because
 we know it's not keeping up.

Yes, that makes sense. Also provides adaptability to bursty workloads
which means we don't have too complex logic in the bgwriter for that.

  There's another thing we could do to noticeably improve scalability of
  buffer acquiration. Currently we do a huge amount of work under the
  freelist lock.
  ...
  So, we perform the entire clock sweep until we found a single buffer we
  can use inside a *global* lock. At times we need to iterate over the
  whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
  the usage counts enough (if the database is busy it can take even
  longer...).
  In a busy database where usually all the usagecounts are high the next
  backend will touch a lot of those buffers again which causes massive
  cache eviction  bouncing.
 
  It seems far more sensible to only protect the clock sweep's
  nextVictimBuffer with a spinlock. With some care all the rest can happen
  without any global interlock.
 
 That's a lot more spinlock acquire/release cycles, but it might work
 out to a win anyway.  Or it might lead to the system suffering a
 horrible spinlock-induced death spiral on eviction-heavy workloads.

I can't imagine it to be worse that what we have today. Also, nobody
requires us to only advance the clocksweep by one page, we can easily do
it say 29 pages at a time or so if we detect the lock is contended.

Alternatively it shouldn't be too hard to make it into an atomic
increment, although that requires some trickery to handle the wraparound
sanely.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:

 I don't think I actually found any workload where the bgwriter
 actually wroute out a relevant percentage of the necessary pages.

I had one at Wisconsin Courts.  The database which we targeted with
logical replication from the 72 circuit court databases (plus a few
others) on six database connection pool with about 20 to (at peaks)
hundreds of transactions per second modifying the database (the
average transaction involving about 20 modifying statements with
potentially hundreds of affected rows), with maybe 2000 to 3000
queries per second on a 30 connection pool, wrote about one-third
each of the dirty buffers with checkpoints, background writer, and
backends needing to read a page.  I shared my numbers with Greg,
who I believe used them as one of his examples for how to tune
memory, checkpoints, and background writer, so you might want to
check with him if you want more detail.

Of course, we set bgwriter_lru_maxpages = 1000 and
bgwriter_lru_multiplier = 4, and kept shared_buffers to 2GB to hit
that.  Without the reduced shared_buffers and more aggressive
bgwriter we hit the problem with writes overwhelming the RAID
controller's cache and causing everything in the database to
freeze until it cleared some cache space.

I'm not saying this invalidates your general argument; just that
such cases do exist.  Hopefully this data point is useful.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Amit Kapila
On Thursday, June 27, 2013 5:54 PM Robert Haas wrote:
 On Wed, Jun 26, 2013 at 8:09 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Configuration Details
  O/S - Suse-11
  RAM - 128GB
  Number of Cores - 16
  Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
  synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off
 Pgbench -
  Select-only Scalefactor - 1200 Time - 30 mins
 
   8C-8T16C-16T32C-32T64C-
 64T
  Head   62403101810 99516  94707
  Patch  62827101404 99109  94744
 
  On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB
 shared
  buffers, this is no major difference.
  One of the reasons could be that there is no much swapping in shared
 buffers
  as most data already fits in shared buffers.
 
 I'd like to just back up a minute here and talk about the broader
 picture here.  What are we trying to accomplish with this patch?  Last
 year, I did some benchmarking on a big IBM POWER7 machine (16 cores,
 64 hardware threads).  Here are the results:
 
 http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-
 ibm.html
 
 Now, if you look at these results, you see something interesting.
 When there aren't too many concurrent connections, the higher scale
 factors are only modestly slower than the lower scale factors.  But as
 the number of connections increases, the performance continues to rise
 at the lower scale factors, and at the higher scale factors, this
 performance stops rising and in fact drops off.  So in other words,
 there's no huge *performance* problem for a working set larger than
 shared_buffers, but there is a huge *scalability* problem.  Now why is
 that?
 
 As far as I can tell, the answer is that we've got a scalability
 problem around BufFreelistLock.  Contention on the buffer mapping
 locks may also be a problem, but all of my previous benchmarking (with
 LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
 in the room.  My interest in having the background writer add buffers
 to the free list is basically around solving that problem.  It's a
 pretty dramatic problem, as the graph above shows, and this patch
 doesn't solve it.  There may be corner cases where this patch improves
 things (or, equally, makes them worse) but as a general point, the
 difficulty I've had reproducing your test results and the specificity
 of your instructions for reproducing them suggests to me that what we
 have here is not a clear improvement on general workloads.  Yet such
 an improvement should exist, because there are other products in the
 world that have scalable buffer managers; we currently don't.  Instead
 of spending a lot of time trying to figure out whether there's a small
 win in narrow cases here (and there may well be), I think we should
 back up and ask why this isn't a great big win, and what we'd need to
 do to *get* a great big win.  I don't see much point in tinkering
 around the edges here if things are broken in the middle; things that
 seem like small wins or losses now may turn out otherwise in the face
 of a more comprehensive solution.
 
 One thing that occurred to me while writing this note is that the
 background writer doesn't have any compelling reason to run on a
 read-only workload.  It will still run at a certain minimum rate, so
 that it cycles the buffer pool every 2 minutes, if I remember
 correctly.  But it won't run anywhere near fast enough to keep up with
 the buffer allocation demands of 8, or 32, or 64 sessions all reading
 data not all of which is in shared_buffers at top speed.  In fact,
 we've had reports that the background writer isn't too effective even
 on read-write workloads.  The point is - if the background writer
 isn't waking up and running frequently enough, what it does when it
 does wake up isn't going to matter very much.  I think we need to
 spend some energy poking at that.

Currently it wakes up based on bgwriterdelay config parameter which is by
default 200ms, so you means we should
think of waking up bgwriter based on allocations and number of elements left
in freelist?

As per my understanding Summarization of points raised by you and Andres
which this patch should address to have a bigger win:

1. Bgwriter needs to be improved so that it can help in reducing usage count
and finding next victim buffer 
   (run the clock sweep and add buffers to the free list).
2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist are
less.
3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
   (a spinlock for the freelist, and an lwlock for the clock sweep).
4. Separate processes for writing dirty buffers and moving buffers to
freelist
5. Bgwriter needs to be more aggressive, logic based on which it calculates
how many buffers it needs to process needs to be improved.
6. There can be contention around buffer mapping locks, but we can focus 

Re: [HACKERS] Move unused buffers to freelist

2013-06-26 Thread Amit Kapila
On Tuesday, June 25, 2013 10:25 AM Amit Kapila wrote:
 On Monday, June 24, 2013 11:00 PM Robert Haas wrote:
  On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila amit.kap...@huawei.com
  wrote:
   To avoid above 3 factors in test readings, I used below steps:
   1. Initialize the database with scale factor such that database
 size
  +
   shared_buffers = RAM (shared_buffers = 1/4 of RAM).
  For example:
  Example -1
   if RAM = 128G, then initialize db with scale factor
 =
  6700
   and shared_buffers = 32GB.
   Database size (98 GB) + shared_buffers (32GB) = 130
  (which
   is approximately equal to total RAM)
  Example -2 (this is based on your test m/c)
   If RAM = 64GB, then initialize db with scale factor
 =
  3400
   and shared_buffers = 16GB.
   2. reboot m/c
   3. Load all buffers with data (tables/indexes of pgbench) using
  pg_prewarm.
   I had loaded 3 times, so that usage count of buffers will be
  approximately
   3.
 
  Hmm.  I don't think the usage count will actually end up being 3,
  though, because the amount of data you're loading is sized to 3/4 of
  RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
  of pg_prewarm will end up turning over the entire cache and you'll
  never get any usage counts more than 1 this way.  Am I confused?
 
 The way I am pre-warming is that loading the data of relation
 (table/index)
 continuously 3 times, so mostly the buffers will contain the data of
 relations loaded in last
 which are indexes and also they got accessed more during scans. So
 usage
 count should be 3.
 Can you please once see load_all_buffers.sql, may be my understanding
 has
 some gap.
 
 Now about the question why then load all the relations.
 Apart from PostgreSQL shared buffers, loading data this way can also
 make sure OS buffers will have the data with higher usage count which
 can
 lead to better OS scheduling.
 
  I wonder if it would be beneficial to test the case where the
 database
  size is just a little more than shared_buffers.  I think that would
  lead to a situation where the usage counts are high most of the time,
  which - now that you mention it - seems like the sweet spot for this
  patch.
 
 I will check this case and take the readings for same. Thanks for your
 suggestions.

Configuration Details
O/S - Suse-11
RAM - 128GB
Number of Cores - 16
Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench -
Select-only Scalefactor - 1200 Time - 30 mins

 8C-8T16C-16T32C-32T64C-64T 
Head   62403101810 99516  94707 
Patch  62827101404 99109  94744

On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared
buffers, this is no major difference.
One of the reasons could be that there is no much swapping in shared buffers
as most data already fits in shared buffers.


I think more readings are need for combinations related to below settings:
scale factor such that database size + shared_buffers = RAM (shared_buffers
= 1/4 of RAM).

I can try varying shared_buffer size.

Kindly let me know your suggestions?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-24 Thread Robert Haas
On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila amit.kap...@huawei.com wrote:
 To avoid above 3 factors in test readings, I used below steps:
 1. Initialize the database with scale factor such that database size +
 shared_buffers = RAM (shared_buffers = 1/4 of RAM).
For example:
Example -1
 if RAM = 128G, then initialize db with scale factor = 6700
 and shared_buffers = 32GB.
 Database size (98 GB) + shared_buffers (32GB) = 130 (which
 is approximately equal to total RAM)
Example -2 (this is based on your test m/c)
 If RAM = 64GB, then initialize db with scale factor = 3400
 and shared_buffers = 16GB.
 2. reboot m/c
 3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
 I had loaded 3 times, so that usage count of buffers will be approximately
 3.

Hmm.  I don't think the usage count will actually end up being 3,
though, because the amount of data you're loading is sized to 3/4 of
RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
of pg_prewarm will end up turning over the entire cache and you'll
never get any usage counts more than 1 this way.  Am I confused?

I wonder if it would be beneficial to test the case where the database
size is just a little more than shared_buffers.  I think that would
lead to a situation where the usage counts are high most of the time,
which - now that you mention it - seems like the sweet spot for this
patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-24 Thread Amit Kapila
On Monday, June 24, 2013 11:00 PM Robert Haas wrote:
 On Thu, Jun 6, 2013 at 3:01 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  To avoid above 3 factors in test readings, I used below steps:
  1. Initialize the database with scale factor such that database size
 +
  shared_buffers = RAM (shared_buffers = 1/4 of RAM).
 For example:
 Example -1
  if RAM = 128G, then initialize db with scale factor =
 6700
  and shared_buffers = 32GB.
  Database size (98 GB) + shared_buffers (32GB) = 130
 (which
  is approximately equal to total RAM)
 Example -2 (this is based on your test m/c)
  If RAM = 64GB, then initialize db with scale factor =
 3400
  and shared_buffers = 16GB.
  2. reboot m/c
  3. Load all buffers with data (tables/indexes of pgbench) using
 pg_prewarm.
  I had loaded 3 times, so that usage count of buffers will be
 approximately
  3.
 
 Hmm.  I don't think the usage count will actually end up being 3,
 though, because the amount of data you're loading is sized to 3/4 of
 RAM, and shared_buffers is just 1/4 of RAM, so I think that each run
 of pg_prewarm will end up turning over the entire cache and you'll
 never get any usage counts more than 1 this way.  Am I confused?

The way I am pre-warming is that loading the data of relation (table/index)
continuously 3 times, so mostly the buffers will contain the data of
relations loaded in last
which are indexes and also they got accessed more during scans. So usage
count should be 3.
Can you please once see load_all_buffers.sql, may be my understanding has
some gap.

Now about the question why then load all the relations.
Apart from PostgreSQL shared buffers, loading data this way can also
make sure OS buffers will have the data with higher usage count which can
lead to better OS scheduling.

 I wonder if it would be beneficial to test the case where the database
 size is just a little more than shared_buffers.  I think that would
 lead to a situation where the usage counts are high most of the time,
 which - now that you mention it - seems like the sweet spot for this
 patch.

I will check this case and take the readings for same. Thanks for your
suggestions.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-06 Thread Amit Kapila
On Tuesday, May 28, 2013 6:54 PM Robert Haas wrote:
  Instead, I suggest modifying BgBufferSync, specifically this part
 right
  here:
 
  else if (buffer_state  BUF_REUSABLE)
  reusable_buffers++;
 
  What I would suggest is that if the BUF_REUSABLE flag is set here,
 use
  that as the trigger to do StrategyMoveBufferToFreeListEnd().
 
  I think at this point also we need to lock buffer header to check
 refcount
  and usage_count before moving to freelist, or do you think it is not
  required?
 
 If BUF_REUSABLE is set, that means we just did exactly what you're
 saying.  Why do it twice?

Even if we just did it, but we have released the buf header lock, so
theoretically chances are there that backend can increase the count, however
still it will be protected by check in StrategyGetBuffer(). As there is a
very rare chance of it, so doing without buffer header lock might not cause
any harm.
Modified patch to address the same is attached with mail. 

Performance Data
---

As far as I have noticed, performance data for this patch depends on 3
factors
1. Pre-loading of data in buffers, so that buffers holding pages should have
some usage count before running pgbench. 
   Reason is it might be creating difference in performance of clock-sweep 
2. Clearing of pages in OS cache before running pgbench with different
patch, it can create difference because when we run pgbench with or without
patch, 
   it can access pages already cached due to previous runs which causes
variation in performance. 
3. Scale factor and shared buffer configuration

To avoid above 3 factors in test readings, I used below steps:
1. Initialize the database with scale factor such that database size +
shared_buffers = RAM (shared_buffers = 1/4 of RAM).
   For example: 
   Example -1
if RAM = 128G, then initialize db with scale factor = 6700
and shared_buffers = 32GB.
Database size (98 GB) + shared_buffers (32GB) = 130 (which
is approximately equal to total RAM)
   Example -2 (this is based on your test m/c)
If RAM = 64GB, then initialize db with scale factor = 3400
and shared_buffers = 16GB.
2. reboot m/c
3. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
I had loaded 3 times, so that usage count of buffers will be approximately
3.
   Used file load_all_buffers.sql attached with this mail
4. run 3 times pgbench select-only case for 10 or 15 minutes without patch
5. reboot m/c
6. Load all buffers with data (tables/indexes of pgbench) using pg_prewarm.
I had loaded 3 times, so that usage count of buffers will be approximately
3.
   Used file load_all_buffers.sql attached with this mail
7. run 3 times pgbench select-only case for 10 or 15 minutes with patch

Using above steps, I had taken performance data on 2 different m/c's

Configuration Details
O/S - Suse-11
RAM - 128GB
Number of Cores - 16
Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
synchronous_commit = 0FF, shared_buffers = 32GB, AutoVacuum=off
Pgbench - Select-only
Scalefactor - 1200
Time - Each run is of 15 mins

Below data is for average of 3 runs

   16C-16T32C-32T64C-64T
HEAD   43913971 3464
After Patch61475093 3944

Detailed data of each run is attached with mail in file
move_unused_buffers_to_freelist_v2.htm

Below data is for 1 run of half hour on same configuration

   16C-16T32C-32T64C-64T
HEAD   43773861 3295
After Patch65424770 3504


Configuration Details
O/S - Suse-11
RAM - 24GB
Number of Cores - 8
Server Conf - checkpoint_segments = 256; checkpoint_timeout = 25 min,
synchronous_commit = 0FF, shared_buffers = 5GB
Pgbench - Select-only
Scalefactor - 1200
Time - Each run is of 10 mins

Below data is for average 3 runs of 10 minutes

   8C-8T16C-16T32C-32T
64C-64T 128C-128T   256C-256T
HEAD   58837   56740   19390
568131912160
After Patch59482   56936   25070
765541662704

Detailed data of each run is attached with mail in file
move_unused_buffers_to_freelist_v2.htm


Below data is for 1 run of half hour on same configuration

   32C-32T 
HEAD   17703 
After Patch20586 

I had run these tests multiple times to ensure the correctness. I think last
time why it didn't show performance improvement in your runs is
because the way we both are running pgbench is different. This time, I have
detailed the steps I have used to collect performance data.


With Regards,
Amit Kapila.


move_unused_buffers_to_freelist_v2.patch
Description: Binary data















 
 
 
 
  
  
  
  

Re: [HACKERS] Move unused buffers to freelist

2013-05-28 Thread Robert Haas
 Instead, I suggest modifying BgBufferSync, specifically this part right
 here:

 else if (buffer_state  BUF_REUSABLE)
 reusable_buffers++;

 What I would suggest is that if the BUF_REUSABLE flag is set here, use
 that as the trigger to do StrategyMoveBufferToFreeListEnd().

 I think at this point also we need to lock buffer header to check refcount
 and usage_count before moving to freelist, or do you think it is not
 required?

If BUF_REUSABLE is set, that means we just did exactly what you're
saying.  Why do it twice?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-25 Thread Amit kapila
On Friday, May 24, 2013 8:22 PM Jim Nasby wrote:
On 5/14/13 8:42 AM, Amit Kapila wrote:
 In the attached patch, bgwriter/checkpointer moves unused (usage_count =0  
 refcount = 0) buffer’s to end of freelist. I have implemented a new API 
 StrategyMoveBufferToFreeListEnd() to

 move buffer’s to end of freelist.


 Instead of a separate function, would it be better to add an argument to 
 StrategyFreeBuffer? 

  Yes, it could be done with a parameter which will decide whether to put 
buffer at head or tail in freelist.
  However currently the main focus is to check in which cases this optimization 
can give benefit.
  Robert had ran tests for quite a number of cases where it doesn't show any 
significant gain.
  I am also trying with various configurations to see if it gives any benefit.
  Robert has given some suggestions to change the way currently new function is 
getting called, 
  I will try it and update the results of same.

  I am not very sure that default pgbench is a good test scenario to test this 
optimization.
  If you have any suggestions for tests where it can show benefit, that would 
be a great input.

 ISTM this is similar to the other strategy stuff in the buffer manager, so 
 perhaps it should mirror that...

With Regards,
Amit Kapila.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Amit Kapila
On Thursday, May 23, 2013 8:45 PM Robert Haas wrote:
 On Tue, May 21, 2013 at 3:06 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Here are the results.  The first field in each line is the number of
  clients. The second number is the scale factor.  The numbers after
  master and patched are the median of three runs.
 
 but overall, on both the read-only and
  read-write tests, I'm not seeing anything that resembles the big
 gains
  you reported.
 
  I have not generated numbers for read-write tests, I will check that
 once.
  For read-only tests, the performance increase is minor and different
 from
  what I saw.
  Few points which I could think of for difference in data:
 
  1. In my test's I always observed best data when number of
 clients/threads
  are equal to number of cores which in your case should be at 16.
 
 Sure, but you also showed substantial performance increases across a
 variety of connection counts, whereas I'm seeing basically no change
 at any connection count.
  2. I think for scale factor 100 and 300, there should not be much
  performance increase, as for them they should mostly get buffer from
  freelist inspite of even bgwriter adds to freelist or not.
 
 I agree.
 
  3. In my tests variance is for shared buffers, database size is
 always less
  than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB),
 but due
  to variance in shared buffers, it can lead to I/O.
 
 Not sure I understand this.

What I wanted to say is that all your tests was on same shared buffer
configuration 8GB, where as in my tests I was trying to vary shared buffers
as well.
However this is not important point, as it should show performance gain on
configuration you ran, if there is any real benefit of this patch.
 
  4. Each run is of 20 minutes, not sure if this has any difference.
 
 I've found that 5-minute tests are normally adequate to identify
 performance changes on the pgbench SELECT-only workload.
 
  Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
 the
  PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel
 3.2.6.
 
  To think about the difference in your and my runs, could you please
 tell me
  about below points
  1. What is RAM in machine.
 
 64GB
 
  2. Are number of threads equal to number of clients.
 
 Yes.
 
  3. Before starting tests I have always done pre-warming of buffers
 (used
  pg_prewarm written by you last year), is it same for above read-only
 tests.
 
 No, I did not use pg_prewarm.  But I don't think that should matter
 very much.  First, the data was all in the OS cache.  Second, on the
 small scale factors, everything should end up in cache pretty quickly
 anyway.  And on the large scale factors, well, you're going to be
 churning shared_buffers anyway, so pg_prewarm is only going to affect
 the very beginning of the test.
 
  4. Can you please once again run only the test where you saw
 variation(8
  clients @ scale factor 1000 on master), because I have also seen
 that
  performance difference is very good for certain
 configurations(Scale Factor, RAM, Shared Buffers)
 
 I can do this if I get a chance, but I don't really see where that's
 going to get us.  It seems pretty clear to me that there's no benefit
 on these tests from this patch.  So either one of us is doing the
 benchmarking incorrectly, or there's some difference in our test
 environments that is significant, but none of the proposals you've
 made so far seem to me to explain the difference.

Sorry for requesting you to run again without any concrete point.
I realized after reading data you posted more carefully that the reading was
just some m/c problem or something else, but actually there is no gain.
After your post, I had tried with various configurations on different m/c,
but till now I am not able see the performance gain as was shown in my
initial mail.
Infact I had tried on same m/c as well, it some times give good data. I will
update you if I get any concrete reason and results.

  Apart from above, I had one more observation during my investigation
 to find
  why in some cases, there is a small dip:
  1. Many times, it finds the buffer in free list is not usable, means
 it's
  refcount or usage count is not zero, due to which it had to spend
 more time
  under BufFreelistLock.
 I had not any further experiments related to this finding like if
 it
  really adds any overhead.
 
  Currently I am trying to find reasons for small dip of performance
 and see
  if I could do something to avoid it. Also I will run tests with
 various
  configurations.
 
  Any other suggestions?
 
 Well, I think that the code in SyncOneBuffer is not really optimal.
 In some cases you actually lock and unlock the buffer header an extra
 time, which seems like a whole lotta extra overhead.  In fact, I don't
 think you should be modifying SyncOneBuffer() at all, because that
 affects not only the background writer but also checkpoints.
 Presumably it is not right to put every unused 

Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Amit Kapila
On Friday, May 24, 2013 2:47 AM Jim Nasby wrote:
 On 5/14/13 2:13 PM, Greg Smith wrote:
  It is possible that we are told to put something in the freelist that
  is already in it; don't screw up the list if so.
 
  I don't see where the code does anything to handle that though.  What
 was your intention here?
 
 IIRC, the code that pulls from the freelist already deals with the
 possibility that a block was on the freelist but has since been put to
 use. 

You are right, the check exists in StrategyGetBuffer()

If that's the case then there shouldn't be much penalty to adding
 a block multiple times (at least within reason...)

There is a check in StrategyFreeBuffer() which will not allow to put
multiple times, 
I had just used the same check in new function.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-24 Thread Jim Nasby

On 5/14/13 8:42 AM, Amit Kapila wrote:

In the attached patch, bgwriter/checkpointer moves unused (usage_count =0  
refcount = 0) buffer’s to end of freelist. I have implemented a new API 
StrategyMoveBufferToFreeListEnd() to

move buffer’s to end of freelist.



Instead of a separate function, would it be better to add an argument to 
StrategyFreeBuffer? ISTM this is similar to the other strategy stuff in the 
buffer manager, so perhaps it should mirror that...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-23 Thread Robert Haas
On Tue, May 21, 2013 at 3:06 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Here are the results.  The first field in each line is the number of
 clients. The second number is the scale factor.  The numbers after
 master and patched are the median of three runs.

 01 100 master 1433.297699 patched 1420.306088
 01 300 master 1371.286876 patched 1368.910732
 01 1000 master 1056.891901 patched 1067.341658
 01 3000 master 637.312651 patched 685.205011
 08 100 master 10575.017704 patched 11456.043638
 08 300 master 9262.601107 patched 9120.925071
 08 1000 master 1721.807658 patched 1800.733257
 08 3000 master 819.694049 patched 854.333830
 32 100 master 26981.677368 patched 27024.507600
 32 300 master 14554.870871 patched 14778.285400
 32 1000 master 1941.733251 patched 1990.248137
 32 3000 master 846.654654 patched 892.554222

 Is the above test for tpc-b?
 In the above tests, there is performance increase from 1~8% and decrease
 from 0.2~1.5%

It's just the default pgbench workload.

 And here's the same results for 5-minute, read-only tests:

 01 100 master 9361.073952 patched 9049.553997
 01 300 master 8640.235680 patched 8646.590739
 01 1000 master 8339.364026 patched 8342.799468
 01 3000 master 7968.428287 patched 7882.121547
 08 100 master 71311.491773 patched 71812.899492
 08 300 master 69238.839225 patched 70063.632081
 08 1000 master 34794.778567 patched 65998.468775
 08 3000 master 60834.509571 patched 61165.998080
 32 100 master 203168.264456 patched 205258.283852
 32 300 master 199137.276025 patched 200391.633074
 32 1000 master 177996.853496 patched 176365.732087
 32 3000 master 149891.147442 patched 148683.269107

 Something appears to have screwed up my results for 8 clients @ scale
 factor 300 on master,

   Do you want to say the reading of 1000 scale factor?

Yes.

but overall, on both the read-only and
 read-write tests, I'm not seeing anything that resembles the big gains
 you reported.

 I have not generated numbers for read-write tests, I will check that once.
 For read-only tests, the performance increase is minor and different from
 what I saw.
 Few points which I could think of for difference in data:

 1. In my test's I always observed best data when number of clients/threads
 are equal to number of cores which in your case should be at 16.

Sure, but you also showed substantial performance increases across a
variety of connection counts, whereas I'm seeing basically no change
at any connection count.

 2. I think for scale factor 100 and 300, there should not be much
 performance increase, as for them they should mostly get buffer from
 freelist inspite of even bgwriter adds to freelist or not.

I agree.

 3. In my tests variance is for shared buffers, database size is always less
 than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due
 to variance in shared buffers, it can lead to I/O.

Not sure I understand this.

 4. Each run is of 20 minutes, not sure if this has any difference.

I've found that 5-minute tests are normally adequate to identify
performance changes on the pgbench SELECT-only workload.

 Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
 PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

 To think about the difference in your and my runs, could you please tell me
 about below points
 1. What is RAM in machine.

64GB

 2. Are number of threads equal to number of clients.

Yes.

 3. Before starting tests I have always done pre-warming of buffers (used
 pg_prewarm written by you last year), is it same for above read-only tests.

No, I did not use pg_prewarm.  But I don't think that should matter
very much.  First, the data was all in the OS cache.  Second, on the
small scale factors, everything should end up in cache pretty quickly
anyway.  And on the large scale factors, well, you're going to be
churning shared_buffers anyway, so pg_prewarm is only going to affect
the very beginning of the test.

 4. Can you please once again run only the test where you saw variation(8
 clients @ scale factor 1000 on master), because I have also seen that
 performance difference is very good for certain
configurations(Scale Factor, RAM, Shared Buffers)

I can do this if I get a chance, but I don't really see where that's
going to get us.  It seems pretty clear to me that there's no benefit
on these tests from this patch.  So either one of us is doing the
benchmarking incorrectly, or there's some difference in our test
environments that is significant, but none of the proposals you've
made so far seem to me to explain the difference.

 Apart from above, I had one more observation during my investigation to find
 why in some cases, there is a small dip:
 1. Many times, it finds the buffer in free list is not usable, means it's
 refcount or usage count is not zero, due to which it had to spend more time
 under BufFreelistLock.
I had not any further experiments related to this finding like if it
 really adds any 

Re: [HACKERS] Move unused buffers to freelist

2013-05-23 Thread Jim Nasby

On 5/14/13 2:13 PM, Greg Smith wrote:

It is possible that we are told to put something in the freelist that
is already in it; don't screw up the list if so.

I don't see where the code does anything to handle that though.  What was your 
intention here?


IIRC, the code that pulls from the freelist already deals with the possibility 
that a block was on the freelist but has since been put to use. If that's the 
case then there shouldn't be much penalty to adding a block multiple times (at 
least within reason...)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-21 Thread Amit Kapila
On Monday, May 20, 2013 6:54 PM Robert Haas wrote:
 On Thu, May 16, 2013 at 10:18 AM, Amit Kapila amit.kap...@huawei.com
 wrote:
  Further Performance Data:
 
  Below data is for average 3 runs of 20 minutes
 
  Scale Factor   - 1200
  Shared Buffers - 7G
 
 These results are good but I don't get similar results in my own
 testing.  

Thanks for running detailed tests

 I ran pgbench tests at a variety of client counts and scale
 factors, using 30-minute test runs and the following non-default
 configuration parameters.
 
 shared_buffers = 8GB
 maintenance_work_mem = 1GB
 synchronous_commit = off
 checkpoint_segments = 300
 checkpoint_timeout = 15min
 checkpoint_completion_target = 0.9
 log_line_prefix = '%t [%p] '
 
 Here are the results.  The first field in each line is the number of
 clients. The second number is the scale factor.  The numbers after
 master and patched are the median of three runs.
 
 01 100 master 1433.297699 patched 1420.306088
 01 300 master 1371.286876 patched 1368.910732
 01 1000 master 1056.891901 patched 1067.341658
 01 3000 master 637.312651 patched 685.205011
 08 100 master 10575.017704 patched 11456.043638
 08 300 master 9262.601107 patched 9120.925071
 08 1000 master 1721.807658 patched 1800.733257
 08 3000 master 819.694049 patched 854.333830
 32 100 master 26981.677368 patched 27024.507600
 32 300 master 14554.870871 patched 14778.285400
 32 1000 master 1941.733251 patched 1990.248137
 32 3000 master 846.654654 patched 892.554222


Is the above test for tpc-b?
In the above tests, there is performance increase from 1~8% and decrease
from 0.2~1.5%

 And here's the same results for 5-minute, read-only tests:
 
 01 100 master 9361.073952 patched 9049.553997
 01 300 master 8640.235680 patched 8646.590739
 01 1000 master 8339.364026 patched 8342.799468
 01 3000 master 7968.428287 patched 7882.121547
 08 100 master 71311.491773 patched 71812.899492
 08 300 master 69238.839225 patched 70063.632081
 08 1000 master 34794.778567 patched 65998.468775
 08 3000 master 60834.509571 patched 61165.998080
 32 100 master 203168.264456 patched 205258.283852
 32 300 master 199137.276025 patched 200391.633074
 32 1000 master 177996.853496 patched 176365.732087
 32 3000 master 149891.147442 patched 148683.269107
 
 Something appears to have screwed up my results for 8 clients @ scale
 factor 300 on master, 

  Do you want to say the reading of 1000 scale factor?
  
but overall, on both the read-only and
 read-write tests, I'm not seeing anything that resembles the big gains
 you reported.

I have not generated numbers for read-write tests, I will check that once.
For read-only tests, the performance increase is minor and different from
what I saw. 
Few points which I could think of for difference in data:

1. In my test's I always observed best data when number of clients/threads
are equal to number of cores which in your case should be at 16.
2. I think for scale factor 100 and 300, there should not be much
performance increase, as for them they should mostly get buffer from
freelist inspite of even bgwriter adds to freelist or not. 
3. In my tests variance is for shared buffers, database size is always less
than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due
to variance in shared buffers, it can lead to I/O. 
4. Each run is of 20 minutes, not sure if this has any difference.
 
 Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
 PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

To think about the difference in your and my runs, could you please tell me
about below points
1. What is RAM in machine.
2. Are number of threads equal to number of clients.
3. Before starting tests I have always done pre-warming of buffers (used
pg_prewarm written by you last year), is it same for above read-only tests.
4. Can you please once again run only the test where you saw variation(8
clients @ scale factor 1000 on master), because I have also seen that
performance difference is very good for certain
   configurations(Scale Factor, RAM, Shared Buffers)

Apart from above, I had one more observation during my investigation to find
why in some cases, there is a small dip:
1. Many times, it finds the buffer in free list is not usable, means it's
refcount or usage count is not zero, due to which it had to spend more time
under BufFreelistLock.
   I had not any further experiments related to this finding like if it
really adds any overhead.

Currently I am trying to find reasons for small dip of performance and see
if I could do something to avoid it. Also I will run tests with various
configurations.

Any other suggestions?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-21 Thread Amit Kapila
On Tuesday, May 21, 2013 12:36 PM Amit Kapila wrote:
 On Monday, May 20, 2013 6:54 PM Robert Haas wrote:
  On Thu, May 16, 2013 at 10:18 AM, Amit Kapila
 amit.kap...@huawei.com
  wrote:
   Further Performance Data:
  
   Below data is for average 3 runs of 20 minutes
  
   Scale Factor   - 1200
   Shared Buffers - 7G
 
  These results are good but I don't get similar results in my own
  testing.
 
 Thanks for running detailed tests
 
  I ran pgbench tests at a variety of client counts and scale
  factors, using 30-minute test runs and the following non-default
  configuration parameters.
 
  shared_buffers = 8GB
  maintenance_work_mem = 1GB
  synchronous_commit = off
  checkpoint_segments = 300
  checkpoint_timeout = 15min
  checkpoint_completion_target = 0.9
  log_line_prefix = '%t [%p] '
 
  Here are the results.  The first field in each line is the number of
  clients. The second number is the scale factor.  The numbers after
  master and patched are the median of three runs.
 
  01 100 master 1433.297699 patched 1420.306088
  01 300 master 1371.286876 patched 1368.910732
  01 1000 master 1056.891901 patched 1067.341658
  01 3000 master 637.312651 patched 685.205011
  08 100 master 10575.017704 patched 11456.043638
  08 300 master 9262.601107 patched 9120.925071
  08 1000 master 1721.807658 patched 1800.733257
  08 3000 master 819.694049 patched 854.333830
  32 100 master 26981.677368 patched 27024.507600
  32 300 master 14554.870871 patched 14778.285400
  32 1000 master 1941.733251 patched 1990.248137
  32 3000 master 846.654654 patched 892.554222
 
 
 Is the above test for tpc-b?
 In the above tests, there is performance increase from 1~8% and
 decrease
 from 0.2~1.5%
 
  And here's the same results for 5-minute, read-only tests:
 
  01 100 master 9361.073952 patched 9049.553997
  01 300 master 8640.235680 patched 8646.590739
  01 1000 master 8339.364026 patched 8342.799468
  01 3000 master 7968.428287 patched 7882.121547
  08 100 master 71311.491773 patched 71812.899492
  08 300 master 69238.839225 patched 70063.632081
  08 1000 master 34794.778567 patched 65998.468775
  08 3000 master 60834.509571 patched 61165.998080
  32 100 master 203168.264456 patched 205258.283852
  32 300 master 199137.276025 patched 200391.633074
  32 1000 master 177996.853496 patched 176365.732087
  32 3000 master 149891.147442 patched 148683.269107
 
  Something appears to have screwed up my results for 8 clients @ scale
  factor 300 on master,
 
   Do you want to say the reading of 1000 scale factor?
 
 but overall, on both the read-only and
  read-write tests, I'm not seeing anything that resembles the big
 gains
  you reported.
 
 I have not generated numbers for read-write tests, I will check that
 once.
 For read-only tests, the performance increase is minor and different
 from
 what I saw.
 Few points which I could think of for difference in data:
 
 1. In my test's I always observed best data when number of
 clients/threads
 are equal to number of cores which in your case should be at 16.
 2. I think for scale factor 100 and 300, there should not be much
 performance increase, as for them they should mostly get buffer from
 freelist inspite of even bgwriter adds to freelist or not.
 3. In my tests variance is for shared buffers, database size is always
 less
 than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but
 due
 to variance in shared buffers, it can lead to I/O.
 4. Each run is of 20 minutes, not sure if this has any difference.
 
  Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
 the
  PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.
 
 To think about the difference in your and my runs, could you please
 tell me
 about below points
 1. What is RAM in machine.
 2. Are number of threads equal to number of clients.
 3. Before starting tests I have always done pre-warming of buffers
 (used
 pg_prewarm written by you last year), is it same for above read-only
 tests.
 4. Can you please once again run only the test where you saw
 variation(8
 clients @ scale factor 1000 on master), because I have also seen that
 performance difference is very good for certain
configurations(Scale Factor, RAM, Shared Buffers)

On looking more closely at data posted by you, I believe that there is some
problem with data (8
clients @ scale factor 1000 on master) as in all other cases, the data for
scale factor 1000 is better than 3000 except for this case.
So I think no need to run again.

 Apart from above, I had one more observation during my investigation to
 find
 why in some cases, there is a small dip:
 1. Many times, it finds the buffer in free list is not usable, means
 it's
 refcount or usage count is not zero, due to which it had to spend more
 time
 under BufFreelistLock.
I had not any further experiments related to this finding like if it
 really adds any overhead.
 
 Currently I am trying to find reasons for small dip of performance and
 see
 if I could do 

Re: [HACKERS] Move unused buffers to freelist

2013-05-20 Thread Robert Haas
On Thu, May 16, 2013 at 10:18 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Further Performance Data:

 Below data is for average 3 runs of 20 minutes

 Scale Factor   - 1200
 Shared Buffers - 7G

These results are good but I don't get similar results in my own
testing.  I ran pgbench tests at a variety of client counts and scale
factors, using 30-minute test runs and the following non-default
configuration parameters.

shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
log_line_prefix = '%t [%p] '

Here are the results.  The first field in each line is the number of
clients.  The second number is the scale factor.  The numbers after
master and patched are the median of three runs.

01 100 master 1433.297699 patched 1420.306088
01 300 master 1371.286876 patched 1368.910732
01 1000 master 1056.891901 patched 1067.341658
01 3000 master 637.312651 patched 685.205011
08 100 master 10575.017704 patched 11456.043638
08 300 master 9262.601107 patched 9120.925071
08 1000 master 1721.807658 patched 1800.733257
08 3000 master 819.694049 patched 854.333830
32 100 master 26981.677368 patched 27024.507600
32 300 master 14554.870871 patched 14778.285400
32 1000 master 1941.733251 patched 1990.248137
32 3000 master 846.654654 patched 892.554222

And here's the same results for 5-minute, read-only tests:

01 100 master 9361.073952 patched 9049.553997
01 300 master 8640.235680 patched 8646.590739
01 1000 master 8339.364026 patched 8342.799468
01 3000 master 7968.428287 patched 7882.121547
08 100 master 71311.491773 patched 71812.899492
08 300 master 69238.839225 patched 70063.632081
08 1000 master 34794.778567 patched 65998.468775
08 3000 master 60834.509571 patched 61165.998080
32 100 master 203168.264456 patched 205258.283852
32 300 master 199137.276025 patched 200391.633074
32 1000 master 177996.853496 patched 176365.732087
32 3000 master 149891.147442 patched 148683.269107

Something appears to have screwed up my results for 8 clients @ scale
factor 300 on master, but overall, on both the read-only and
read-write tests, I'm not seeing anything that resembles the big gains
you reported.

Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Move unused buffers to freelist

2013-05-14 Thread Amit Kapila
As discussed and concluded in mail thread
(http://www.postgresql.org/message-id/006f01ce34f0$d6fa8220$84ef8660$@kapila
@huawei.com), for moving unused buffer's to freelist end, 

I having implemented the idea and taken some performance data.

 

 

In the attached patch, bgwriter/checkpointer moves unused (usage_count =0 
refcount = 0) buffer's to end of freelist. I have implemented a new API
StrategyMoveBufferToFreeListEnd() to 

move buffer's to end of freelist.

 

Performance Data :

 

Configuration Details

O/S - Suse-11

RAM - 24GB

Number of Cores - 8

Server Conf - checkpoint_segments = 256; checkpoint_timeout = 25 min,
synchronous_commit = 0FF, shared_buffers = 5GB

Pgbench - Select-only

Scalefactor - 1200

Time - Each run is of 20 mins

 

Below data is for average 3 runs of 20 minutes

 

8C-8T16C-16T32C-32T
64C-64T

HEAD   11997   8455 4989
2757

After Patch19807   13296   8388
2821

 

Detailed each run data is attached with mail.

 

This is just the initial data, I will collect more data based on different
configuration of shared buffers and other configurations.

 

Feedback/Suggesions?

 

With Regards,

Amit Kapila.
















 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  Configuration Used
  
  
  
  
  
  
  
 
 
  
  
  O/S
  Suse-11
  
  
  
  
  
  
  
 
 
  
  
  RAM
  24GB
  
  
  
  
  
  
  
 
 
  
  
  Number of Cores
  8
  
  
  
  
  
  
  
 
 
  
  
  Server Conf: checkpoint_segments - 256,
  checkpoint_timeout = 25 min, synchronous_commit=off
 
 
  
  
  shared_buffers
  5GB
  
  
  
  
  
  
  
 
 
  
  
  pgbench
  select-only
  
  
  
  
  
  
 
 
  
  
  pgbench-Scale Factor
  1200
  
  
  
  
  
  
  
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  
  HEAD
  
  
  
  
  
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  Runs/Concurrency
  8C-8T
  16C-16T
  32C-32T
  64C-64T
  
  
  
  
 
 
  
  
  Run-1
  9664
  7425
  5106
  2570
  
  
  
  
 
 
  
  
  Run-2
  13081
  9009
  4874
  2713
  
  
  
  
 
 
  
  
  Run-3
  13246
  8932
  4988
  2990
  
  
  
  
 
 
  
  
  Average
  11997
  8455
  4989
  2757
  
  
  
  
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  
  After Patch
  
  
  
  
  
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  Runs/Concurrency
  8C-8T
  16C-16T
  32C-32T
  64C-64T
  
  
  
  
 
 
  
  
  Run-1
  16239
  11539
  7395
  2512
  
  
  
  
 
 
  
  
  Run-2
  21408
  15090
  8923
  2870
  
  
  
  
 
 
  
  
  Run-3
  21776
  13261
  8846
  3083
  
  
  
  
 
 
  
  
  Average
  19807
  13296
  8388
  2821
  
  
  
  
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 
  
  
  Diff In %
  65.09
  57.25
  68.12
  2.3
  
  
  
  
 
 
 
  
  
  
  
  
  
  
  
  
  
  
 
 












move_unsed_buffers_to_freelist.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-14 Thread Greg Smith

On 5/14/13 9:42 AM, Amit Kapila wrote:

In the attached patch, bgwriter/checkpointer moves unused (usage_count
=0  refcount = 0) buffer’s to end of freelist. I have implemented a
new API StrategyMoveBufferToFreeListEnd() to


There's a comment in the new function:

It is possible that we are told to put something in the freelist that
is already in it; don't screw up the list if so.

I don't see where the code does anything to handle that though.  What 
was your intention here?


This area has always been the tricky part of the change.  If you do 
something complicated when adding new entries, like scanning the 
freelist for duplicates, you run the risk of holding BufFreelistLock for 
too long.  To try and see that in benchmarks, I would use a small 
database scale (I typically use 100 for this type of test) and a large 
number of clients.  -M prepared would help get a higher transaction 
rate out of the hardware too.  It might take a server with a large core 
count to notice any issues with holding the lock for too long though.


Instead you might just invalidate buffers before they go onto the list. 
 Doing that will then throw away usefully cached data though.


To try and optimize both insertion speed and retaining cached data, I 
was thinking about using a hash table for the free buffers, instead of 
the simple linked list approach used in the code now.


Also:  check the formatting on the additions to in bufmgr.c, I noticed a 
spaces vs. tabs difference on lines 35/36 of your patch.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-14 Thread Amit Kapila
On Wednesday, May 15, 2013 12:44 AM Greg Smith wrote:
 On 5/14/13 9:42 AM, Amit Kapila wrote:
  In the attached patch, bgwriter/checkpointer moves unused
 (usage_count
  =0  refcount = 0) buffer's to end of freelist. I have implemented a
  new API StrategyMoveBufferToFreeListEnd() to
 
 There's a comment in the new function:
 
 It is possible that we are told to put something in the freelist that
 is already in it; don't screw up the list if so.
 
 I don't see where the code does anything to handle that though.  What
 was your intention here?

The intention is that put the entry in freelist only if it is not in
freelist which is accomplished by check
If (buf-freeNext == FREENEXT_NOT_IN_LIST). Every entry when removed from
freelist, buf-freeNext is marked as FREENEXT_NOT_IN_LIST.
Code Reference (last line):
StrategyGetBuffer()
{
..
..
while (StrategyControl-firstFreeBuffer = 0) 
{ 
buf = BufferDescriptors[StrategyControl-firstFreeBuffer]; 
Assert(buf-freeNext != FREENEXT_NOT_IN_LIST); 

/* Unconditionally remove buffer from freelist */ 
StrategyControl-firstFreeBuffer = buf-freeNext; 
buf-freeNext = FREENEXT_NOT_IN_LIST;

...
}

Also the same check exists in StrategyFreeBuffer().

 This area has always been the tricky part of the change.  If you do
 something complicated when adding new entries, like scanning the
 freelist for duplicates, you run the risk of holding BufFreelistLock
 for
 too long. 

Yes, this is true and I had tried to hold this lock for minimal time. 
In this patch, it holds BufFreelistLock only to put the unused buffer at end
of freelist.

 To try and see that in benchmarks, I would use a small
 database scale (I typically use 100 for this type of test) and a large
 number of clients.  

-M prepared would help get a higher transaction
 rate out of the hardware too.  It might take a server with a large core
 count to notice any issues with holding the lock for too long though.

This is good idea, I shall take another set of readings with -M prepared
 
 Instead you might just invalidate buffers before they go onto the list.
   Doing that will then throw away usefully cached data though.

Yes, if we invalidate buffers, it might throw away usefully cached data
especially when working set just a tiny bit smaller than shared_buffers.
This is pointed by Robert in his mail
http://www.postgresql.org/message-id/CA+TgmoYhWsz__KtSxm6BuBirE7VR6Qqc_COkbE
ztqpk8oom...@mail.gmail.com


 To try and optimize both insertion speed and retaining cached data,

I think by the method proposed by patch it takes care of both, because it
directly puts free buffer at end of freelist and 
because it doesn't invalidate the buffers it can retain cached data for
longer period.
Do you see any flaw with current approach?

 I
 was thinking about using a hash table for the free buffers, instead of
 the simple linked list approach used in the code now.

Okay, we can try different methods for maintaining free buffers if we find
current approach doesn't turn out to be good.
 
 Also:  check the formatting on the additions to in bufmgr.c, I noticed
 a
 spaces vs. tabs difference on lines 35/36 of your patch.

Thanks for pointing it, I shall send an updated patch along with next set of
performance data.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers