Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-28 Thread Greg Smith

On Tue, 27 Mar 2007, Magnus Hagander wrote:


Would not at least some of these numbers be better presented through the
stats collector, so they can be easily monitored?
That goes along the line of my way way way away from finished attempt
earlier, perhaps a combination of these two patches?


When I saw your patch recently, I thought to myself hmmm, the data 
collected here sure looks familiar--you even made some of the exact same 
code changes I did.  I've been bogged down recently chasing a performance 
issue that, come to find, was mainly caused by the high CPU usage for 
stats collector bug.  That caused the background writer to slow to a 
crawl under heavy load, which is why I was having all these checkpoint and 
writer issues that got me monitoring that code in the first place.


With that seemingly resolved, slightly new plan now.  Next I want to take 
the data I've been collecting in my patch, bundle the most important parts 
of that into messages sent to the stats writer the way it was suggested 
you rewrite your patch, then submit the result.  I got log files down and 
have a real good idea what data should be collected, but as this would be 
my first time adding stats I'd certainly love some help with that.


Once that monitoring infrastructure is in place, I then planned to merge 
Itagati's Load distributed checkpoint patch (it touches a lot of the 
same code) and test that out under heavy load.  I think it gives a much 
better context to evaluate that patch in if rather than measuring just its 
gross results, you can say something like with the patch in place the 
average fsync time on my system dropped from 3 seconds to 1.2 seconds when 
writing out more than 100MB at checkpoint time.  That's the direct cause 
of the biggest problem in that area of code, so why not stare right at it 
rather than measuring it indirectly.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-27 Thread Bruce Momjian

Is this patch ready?

---

Greg Smith wrote:
 I have a WIP patch that adds the main detail I have found I need to 
 properly tune checkpoint and background writer activity.  I think it's 
 almost ready to submit (you can see the current patch against 8.2 at 
 http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
 after making it a bit more human-readable.  But I've realized that along 
 with that, I need some guidance in regards to what log level is 
 appropriate for this information.
 
 An example works better than explaining what the patch does:
 
 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
 checkpoint_segments)
 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
 write=279956 us sync=17739974 us
 
 Remember that Load distributed checkpoint discussion back in December? 
 You can see exactly how bad the problem is on your system with this log 
 style (this is from a pgbench run where it's postively awful--it really 
 does take over 17 seconds for the fsync to execute, and there are clients 
 that are hung the whole time waiting for it).
 
 I also instrumented the background writer.  You get messages like this:
 
 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
 Unscanned = 23/54
 
 This shows that we wrote (5) the maximum pages we were allowed to write 
 (5) while failing to scan almost half (23) of the buffers we meant to look 
 at (54).  By taking a look at this output while the system is under load, 
 I found I was able to do bgwriter optimization that used to take me days 
 of frustrating testing in hours.  I've been waiting for a good guide to 
 bgwriter tuning since 8.1 came out.  Once you have this, combined with 
 knowing how many buffers were dirty at checkpoint time because the 
 bgwriter didn't get to them in time (the number you want to minimize), the 
 tuning guide practically writes itself.
 
 So my question is...what log level should all this go at?  Right now, I 
 have the background writer stuff adjusting its level dynamically based on 
 what happened; it logs at DEBUG2 if it hits the write limit (which should 
 be a rare event once you're tuned properly), DEBUG3 for writes that 
 scanned everything they were supposed to, and DEBUG4 if it scanned but 
 didn't find anything to write.  The source of checkpoint information logs 
 at DEBUG1, the fsync/write info at DEBUG2.
 
 I'd like to move some of these up.  On my system, I even have many of the 
 checkpoint messages logged at INFO (the source of the checkpoint and the 
 total write time line).  It's a bit chatty, but when you get some weird 
 system pause issue it makes it easy to figure out if checkpoints were to 
 blame.  Given how useful I feel some of these messages are to system 
 tuning, and to explaining what currently appears as inexplicable pauses, I 
 don't want them to be buried at DEBUG levels where people are unlikely to 
 ever see them (I think some people may be concerned about turning on 
 things labeled DEBUG at all).  I am aware that I am too deep into this to 
 have an unbiased opinion at this point though, which is why I ask for 
 feedback on how to proceed here.
 
 --
 * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
 
 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at
 
 http://www.postgresql.org/about/donate

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-27 Thread Magnus Hagander
Would not at least some of these numbers be better presented through the
stats collector, so they can be easily monitored?

That goes along the line of my way way way away from finished attempt
earlier, perhaps a combination of these two patches?

//Magnus


Bruce Momjian wrote:
 Is this patch ready?
 
 ---
 
 Greg Smith wrote:
 I have a WIP patch that adds the main detail I have found I need to 
 properly tune checkpoint and background writer activity.  I think it's 
 almost ready to submit (you can see the current patch against 8.2 at 
 http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
 after making it a bit more human-readable.  But I've realized that along 
 with that, I need some guidance in regards to what log level is 
 appropriate for this information.

 An example works better than explaining what the patch does:

 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
 checkpoint_segments)
 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
 write=279956 us sync=17739974 us

 Remember that Load distributed checkpoint discussion back in December? 
 You can see exactly how bad the problem is on your system with this log 
 style (this is from a pgbench run where it's postively awful--it really 
 does take over 17 seconds for the fsync to execute, and there are clients 
 that are hung the whole time waiting for it).

 I also instrumented the background writer.  You get messages like this:

 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
 Unscanned = 23/54

 This shows that we wrote (5) the maximum pages we were allowed to write 
 (5) while failing to scan almost half (23) of the buffers we meant to look 
 at (54).  By taking a look at this output while the system is under load, 
 I found I was able to do bgwriter optimization that used to take me days 
 of frustrating testing in hours.  I've been waiting for a good guide to 
 bgwriter tuning since 8.1 came out.  Once you have this, combined with 
 knowing how many buffers were dirty at checkpoint time because the 
 bgwriter didn't get to them in time (the number you want to minimize), the 
 tuning guide practically writes itself.

 So my question is...what log level should all this go at?  Right now, I 
 have the background writer stuff adjusting its level dynamically based on 
 what happened; it logs at DEBUG2 if it hits the write limit (which should 
 be a rare event once you're tuned properly), DEBUG3 for writes that 
 scanned everything they were supposed to, and DEBUG4 if it scanned but 
 didn't find anything to write.  The source of checkpoint information logs 
 at DEBUG1, the fsync/write info at DEBUG2.

 I'd like to move some of these up.  On my system, I even have many of the 
 checkpoint messages logged at INFO (the source of the checkpoint and the 
 total write time line).  It's a bit chatty, but when you get some weird 
 system pause issue it makes it easy to figure out if checkpoints were to 
 blame.  Given how useful I feel some of these messages are to system 
 tuning, and to explaining what currently appears as inexplicable pauses, I 
 don't want them to be buried at DEBUG levels where people are unlikely to 
 ever see them (I think some people may be concerned about turning on 
 things labeled DEBUG at all).  I am aware that I am too deep into this to 
 have an unbiased opinion at this point though, which is why I ask for 
 feedback on how to proceed here.

 --
 * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

 ---(end of broadcast)---
 TIP 7: You can help support the PostgreSQL project by donating at

 http://www.postgresql.org/about/donate
 


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-27 Thread Bruce Momjian
Magnus Hagander wrote:
 Would not at least some of these numbers be better presented through the
 stats collector, so they can be easily monitored?
 
 That goes along the line of my way way way away from finished attempt
 earlier, perhaps a combination of these two patches?

Yes.

---


 
 //Magnus
 
 
 Bruce Momjian wrote:
  Is this patch ready?
  
  ---
  
  Greg Smith wrote:
  I have a WIP patch that adds the main detail I have found I need to 
  properly tune checkpoint and background writer activity.  I think it's 
  almost ready to submit (you can see the current patch against 8.2 at 
  http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
  after making it a bit more human-readable.  But I've realized that along 
  with that, I need some guidance in regards to what log level is 
  appropriate for this information.
 
  An example works better than explaining what the patch does:
 
  2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote 
  checkpoint_segments)
  2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
  2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
  2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
  2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
  2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034 
  write=279956 us sync=17739974 us
 
  Remember that Load distributed checkpoint discussion back in December? 
  You can see exactly how bad the problem is on your system with this log 
  style (this is from a pgbench run where it's postively awful--it really 
  does take over 17 seconds for the fsync to execute, and there are clients 
  that are hung the whole time waiting for it).
 
  I also instrumented the background writer.  You get messages like this:
 
  2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5 
  Unscanned = 23/54
 
  This shows that we wrote (5) the maximum pages we were allowed to write 
  (5) while failing to scan almost half (23) of the buffers we meant to look 
  at (54).  By taking a look at this output while the system is under load, 
  I found I was able to do bgwriter optimization that used to take me days 
  of frustrating testing in hours.  I've been waiting for a good guide to 
  bgwriter tuning since 8.1 came out.  Once you have this, combined with 
  knowing how many buffers were dirty at checkpoint time because the 
  bgwriter didn't get to them in time (the number you want to minimize), the 
  tuning guide practically writes itself.
 
  So my question is...what log level should all this go at?  Right now, I 
  have the background writer stuff adjusting its level dynamically based on 
  what happened; it logs at DEBUG2 if it hits the write limit (which should 
  be a rare event once you're tuned properly), DEBUG3 for writes that 
  scanned everything they were supposed to, and DEBUG4 if it scanned but 
  didn't find anything to write.  The source of checkpoint information logs 
  at DEBUG1, the fsync/write info at DEBUG2.
 
  I'd like to move some of these up.  On my system, I even have many of the 
  checkpoint messages logged at INFO (the source of the checkpoint and the 
  total write time line).  It's a bit chatty, but when you get some weird 
  system pause issue it makes it easy to figure out if checkpoints were to 
  blame.  Given how useful I feel some of these messages are to system 
  tuning, and to explaining what currently appears as inexplicable pauses, I 
  don't want them to be buried at DEBUG levels where people are unlikely to 
  ever see them (I think some people may be concerned about turning on 
  things labeled DEBUG at all).  I am aware that I am too deep into this to 
  have an unbiased opinion at this point though, which is why I ask for 
  feedback on how to proceed here.
 
  --
  * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
 
  ---(end of broadcast)---
  TIP 7: You can help support the PostgreSQL project by donating at
 
  http://www.postgresql.org/about/donate
  
 
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-13 Thread Greg Smith

On Mon, 12 Mar 2007, Tom Lane wrote:


It might also be interesting to know exactly how many buffers were
pinned at the time the scan passed over them.  In theory it should be a
small fraction, but maybe it isn't ...


It is; the theory holds for all the tests I tried today.  The actual 
pinned buffers were so few (typically a fraction of the clients) that I 
reverted to just lumping them in with the recently used ones.  To better 
reflect the vast majority of what it's interacting with, in my patch I 
renamed the SyncOneBuffer skip_pinned to skip_recently_used.  It seems 
natural that something currently pinned would also be considered recently 
used, the current naming I didn't find so obvious.


I'm also now collecting clean vs. dirty usage histogram counts as well 
since you suggested it.  Nothing exciting to report there so far, may note 
something interesting after I collect more data.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-12 Thread Greg Smith

On Fri, 9 Mar 2007, Tom Lane wrote:

It strikes me that the patch would be more useful if it produced a 
histogram of the observed usage_counts


Don't have something worth releasing yet, but I did code a first rev of 
this today.  The results are quite instructive and it's well worth looking 
at.


The main server I work on has shared_buffers=6 for pgbench testing and 
the background writers turned way up (in hopes of reducing checkpoint 
times; that's a whole nother topic).  I run pgbench with s=100 (~1.6GB 
database).  Here's what I get as statistics on the buffer pool after a 
scan when the server is happy, from a run with 20 clients:


writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
dirty buffer usage count histogram:
0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Basically, everything that's left dirty is also heavily used; as I noted 
before, when I inspect with pg_buffercache these are mostly index blocks. 
I note that I should probably generate a second histogram that compares 
the clean data.  The all scan is writing pages out as soon as they dirty, 
the LRU background writer is lucky if it can find anything to do.  (Note 
that I don't rely on the LRU writer more because that causes a significant 
lull in writes after a checkpoint.  By the time it gets going again it's 
harder to catch up, and delaying PostgreSQL writes also aggrevates issues 
with the way Linux caches writes.)


What's I found really interesting was comparing a sad section where 
performance breaks down.  This is from a minute later:


writes=441.6MB (94.2%) pinned+used=356.2MB (76.0%)
dirty buffer usage count histogram:
0=18.7% 1=26.4% 2=31% 3=11% 4=9% 5+=4%

Note how the whole buffer distribution has shifted toward lower usage 
counts.  The breakdown seems to involve evicting the index blocks to make 
room for the recently dirty data when it can't be written out fast enough. 
As the index blocks go poof, things slow further, and into the vicious 
circle you go.  Eventually you can end up blocked on a combination of 
buffer evictions and disk seeks for uncached data that are fighting with 
the writes.


The bgwriter change this suggested to me is defining a triage behavior 
where the all scan switches to acting like the LRU one:


-Each sample period, note what % of the usage_count=0 records are dirty 
-If that number is above a tunable threshold, switch to only writing 
usage_count=0 records until it isn't anymore.


On my system a first guess for that tunable would be 2-5%, based on what 
values I see on either side of the sad periods.  No doubt some systems 
would set that much higher, haven't tested my system at home yet to have a 
guideline for a more typical PC.


As for why this behavior matters so much to me, I actually have a 
prototype auto-tuning background writer design that was hung up on this 
particular problem.  It notes how often you write out max_pages, uses that 
to model the average percentage of the buffer cache you're traversing each 
pass, then runs a couple of weighted-average feedback loops to aim for a 
target seconds/sweep.


The problem was that it went berzerk when the whole buffer cache was dirty 
(I hope someone appreciates that I've been calling this Felix Unger on 
crack mode in my notes).  I needed some way to prioritize which buffers 
to concentrate on when that happens, and so far the above has been a good 
first-cut way to help with that.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-12 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 Here's what I get as statistics on the buffer pool after a 
 scan when the server is happy, from a run with 20 clients:

 writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%)
 dirty buffer usage count histogram:
 0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36%

Interesting, but not real meaningful by itself --- what's the situation
for non-dirty buffers?  And how many are there of each?

It might also be interesting to know exactly how many buffers were
pinned at the time the scan passed over them.  In theory it should be a
small fraction, but maybe it isn't ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Greg Smith

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

Pinned means bufHdr-refcount  0 and you don't distinguish pinned or 
recently-used (bufHdr-usage_count  0) buffers in your patch.


Thank you, I will revise the terminology used accordingly.  I was using 
pinned as a shortcut for will be ignored by skip_pinned which was 
sloppy of me.  As I said, I was trying to show how the buffer cache looks 
from the perspective of the background writer, and therefore lumping them 
together because that's how SyncOneBuffer views them.  A buffer cache full 
of either type will be largely ignored by the LRU writer, and that's what 
I've been finding when running insert/update heavy workloads like pgbench.


If I might suggest a terminology change to avoid this confusion in the 
future, I'd like to rename the SyncOneBuffer skip_pinned parameter to 
something like skip_active, which is closer to the real behavior.  I 
know Oracle refers to these as hot and cold LRU entries.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Jim Nasby

On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
almost everything that's dirty is also pinned during pgbench, and  
the LRU is lucky to find anything it can write as a result


I'm wondering if pg_bench is a good test of this stuff. ISTM it's  
unrealistically write-heavy, which is going to tend to not only put a  
lot of dirty buffers into the pool, but also keep them pinned enough  
that you can't write them.


Perhaps you should either modify pg_bench to do a lot more selects  
out of the various tables or look towards a different benchmark.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Jim Nasby

On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
almost everything that's dirty is also pinned during pgbench, and  
the LRU is lucky to find anything it can write as a result


I'm wondering if pg_bench is a good test of this stuff. ISTM it's  
unrealistically write-heavy, which is going to tend to not only put a  
lot of dirty buffers into the pool, but also keep them pinned enough  
that you can't write them.


Perhaps you should either modify pg_bench to do a lot more selects  
out of the various tables

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Jim Nasby

On Mar 9, 2007, at 7:57 AM, Greg Smith wrote:

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

Pinned means bufHdr-refcount  0 and you don't distinguish  
pinned or recently-used (bufHdr-usage_count  0) buffers in your  
patch.


Thank you, I will revise the terminology used accordingly.  I was  
using pinned as a shortcut for will be ignored by skip_pinned  
which was sloppy of me.  As I said, I was trying to show how the  
buffer cache looks from the perspective of the background writer,  
and therefore lumping them together because that's how  
SyncOneBuffer views them.  A buffer cache full of either type will  
be largely ignored by the LRU writer, and that's what I've been  
finding when running insert/update heavy workloads like pgbench.


If I might suggest a terminology change to avoid this confusion in  
the future, I'd like to rename the SyncOneBuffer skip_pinned  
parameter to something like skip_active, which is closer to the  
real behavior.  I know Oracle refers to these as hot and cold  
LRU entries.


Well, AIUI, whether the buffer is actually pinned or not is almost  
inconsequential (other than if a buffer *is* pinned then it's usage  
count is about to become  0, so there's no reason to consider  
writing it).


What that parameter really does is control whether you're going to  
follow the LRU semantics or not...

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Tom Lane
Jim Nasby [EMAIL PROTECTED] writes:
 On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
 almost everything that's dirty is also pinned during pgbench, and  
 the LRU is lucky to find anything it can write as a result

 I'm wondering if pg_bench is a good test of this stuff.

On reflection I think that Greg's result is probably unsurprising, and
furthermore does not indicate that anything is wrong.

What it shows (now that we got past the terminology) is that only about
half of the buffer pool is subject to replacement during any given clock
sweep.  For low-usage pages that's about what you'd expect: a page is
sucked in on demand (using a buffer returned by the clock sweep), and
when we're done with it it'll have usage_count = 1.  If it's not touched
again then when the clock sweep returns to it it'll be decremented to
usage_count 0, and on the next visit it'll be recycled for use as
something else.  Thus for low-usage pages you'd fully expect that about
half of the buffer population has usage_count 1 and the rest has usage
count 0; which is strikingly close to Greg's measurement that 48.8%
of the population has usage_count 0.

What this seems to tell us is that pgbench's footprint of heavily used
pages (those able to achieve usage_counts above 1) is very small.  Which
is probably right unless you've used a very large scale factor.  I'd be
interested to know what scale factor and shared_buffers setting led to
the above measurement.

It strikes me that the patch would be more useful if it produced a
histogram of the observed usage_counts, rather than merely the count
for usage_count = 0.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Greg Smith

On Fri, 9 Mar 2007, Jim Nasby wrote:

I'm wondering if pg_bench is a good test of this stuff. ISTM it's 
unrealistically write-heavy, which is going to tend to not only put a 
lot of dirty buffers into the pool, but also keep them pinned enough 
that you can't write them.


Whether it's unrealistically write-heavy kind of depends on what your 
real app is.  The standard pgbench is a bit weird because it does so many 
updates to tiny tables, which adds a level of locking contention that 
doesn't really reflect many real-world situations.  But the no-branch mode 
(update/select to accounts, insert into history) isn't too dissimilar from 
some insert-heavy logging applications I've seen.


The main reason I brought this all up was because Itagaki seemed to be 
using pgbench for some of his performance tests.  I just wanted to point 
out that the LRU background writer specifically tends to be very 
underutilized when using pgbench.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Greg Smith

On Fri, 9 Mar 2007, Tom Lane wrote:

I'd be interested to know what scale factor and shared_buffers setting 
led to the above measurement.


That was just a trivial example with 1 client, scale=10 (~160MB database), 
and shared_buffers=24MB.  Where things really get interesting with pgbench 
is on a system with enough horsepower+clients to dirty the whole buffer 
cache well before a checkpoint.  I regularly see 75% of the cache dirty 
and blocked from LRU writes with pgbench's slightly pathological workload 
in that situation.


You're correct that these results aren't particularly surprising or 
indicative of a problem to be corrected.  But they do shed some light on 
what pgbench is and isn't appropriate for testing.



It strikes me that the patch would be more useful if it produced a
histogram of the observed usage_counts, rather than merely the count
for usage_count = 0.


I'll start working in that direction.  With the feedback everyone has 
given me on how few of the buffers are truly pinned via the correct 
usage of the term, I'm going to revisit the usage details and revise that 
section.  I'm happy with how I'm reporting the checkpoint details now, 
still some work left to do on the bgwriter activity.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro
Jim Nasby [EMAIL PROTECTED] wrote:

  Also, my recommended bgwriter_lru_maxpages is average number of
  recycled buffers per cycle, that is hardly able to tune manually.
 
 What do you mean by 'number of recycled buffers per cycle?

There is the following description in the documentation:

| * bgwriter_lru_percent (floating point)
| To reduce the probability that server processes will need to issue their
| own writes, the background writer tries to write buffers that are likely
| to be recycled soon.
| * bgwriter_lru_maxpages (integer)
| In each round, no more than this many buffers will be written as a
| result of scanning soon-to-be-recycled buffers.

The number of recycled buffers per round is the same as the number of
StrategyGetBuffer() calls per round. I think the number is suitable for
bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages.

I proposed to change the semantics of bgwriter_lru_maxpages. It represented
maximum writes per round, but the new meaning is reserved buffers for
recycle. Non-dirty unused buffers will be counted among it.


I'm measuring the difference of performance between manual and automatic
tuning, especially adding some jobs before writes. I'll inform you about
them when I get the results.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro

Greg Smith [EMAIL PROTECTED] wrote:

  Also, my recommended bgwriter_lru_maxpages is average number of
  recycled buffers per cycle, that is hardly able to tune manually.
 
 This is completely dependent on what percentage of your buffer cache is 
 pinned.

Don't you mean usage_count? In my understanding, each backend pins two
or so buffers at once. So percentage of pinned buffers should be low.

 If your load is something like the standard pgbench, the LRU 
 writer will rarely find anything useful to write, so this entire line of 
 thinking won't work.  The proper behavior for heavily pinned data is to 
 turn off the LRU writer altogether so there's more time to run the all 
 scan.

I think you are pointing out the problem of buffer management algorithm
itself, not only bgwriter. Even if you turn off the LRU writer, each
backend pays the same cost to find recyclable buffers every time they
allocate a buffer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread Greg Smith

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

In my understanding, each backend pins two or so buffers at once. So 
percentage of pinned buffers should be low.


With the pgbench workload, a substantial percentage of the buffer cache 
ends up pinned.  From staring at the buffer cache using 
contrib/pg_buffercache, I believe most of that consists of the index 
blocks for the records being updated in the accounts table.


I just posted a new version of the patch I asked for feedback on at the 
beginning of this thread, the latest one is at 
http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
I've been adjusting it to monitor the same data I think you need to refine 
your patch.  I believe the approach you're taking includes some 
assumptions that seem perfectly reasonable, but that my testing doesn't 
agree with.  There's nothing like measuring something to settle what's 
really going on, though, so that's what I've been focusing on.  I'd love 
to get some feedback on whether other people can replicate the kind of 
things I'm seeing.


The new code generates statistics about exactly what the background writer 
scan found during each round.  If there's been substantial write activity, 
it prints a log line when it recycles back to the beginning of the all 
scan, to help characterize what the buffer pool looked like during that 
scan from the perspective of the bgwriter.  Here's some sample log output 
from my underpowered laptop while running pgbench:


bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB 
(31.9%)
...
checkpoint required (wrote checkpoint_segments)
checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms

Here 69% of the buffer cache contained dirty data, and 49% of the cache 
was both pinned and dirty.  During that same time period, the LRU write 
also wrote out a fair amount of data, operating on the 20% of the cache 
that was dirty but not pinned.  On my production server, where the 
background writer is turned way up to reduce checkpoint times, these 
numbers are even more extreme; almost everything that's dirty is also 
pinned during pgbench, and the LRU is lucky to find anything it can write 
as a result.


That patch is against the 8.2 codebase; now that I'm almost done I'm 
planning to move it to HEAD instead soon (where it will conflict 
considerably with your patch).  If you have an 8.2 configuration you can 
test with my patch applied, set log_min_messages = debug2, try it out, and 
see what you get when running pgbench for a while.  I think you'll 
discover some interesting and unexpected things.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 With the pgbench workload, a substantial percentage of the buffer cache 
 ends up pinned.

[ raised eyebrow... ]  Prove that.  AFAIK it's impossible for the
pgbench queries to pin more than about three or four buffers per backend
concurrently.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-08 Thread ITAGAKI Takahiro

Greg Smith [EMAIL PROTECTED] wrote:

  In my understanding, each backend pins two or so buffers at once. So 
  percentage of pinned buffers should be low.
 
 With the pgbench workload, a substantial percentage of the buffer cache 
 ends up pinned.

 http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt 
 bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB 
 (31.9%)
 ...
 checkpoint required (wrote checkpoint_segments)
 checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms
 
 Here 69% of the buffer cache contained dirty data, and 49% of the cache 
 was both pinned and dirty.

No. Pinned means bufHdr-refcount  0 and you don't distinguish pinned or
recently-used (bufHdr-usage_count  0) buffers in your patch.

!   if (bufHdr-refcount != 0 || bufHdr-usage_count != 0)
{
!   if (skip_pinned)
!   {
!   UnlockBufHdr(bufHdr);
!   return BUF_PINNED;
!   }
!   buffer_write_type=BUF_WRITTEN_AND_PINNED;


Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-07 Thread Jim Nasby

On Mar 6, 2007, at 10:11 PM, ITAGAKI Takahiro wrote:

I have some results that if we have plenty of time for checkpoints,
bgwriter_all_maxpages is not a so important parameter because it is
adjusted to shared_buffers / duration of checkpoint.
Also, my recommended bgwriter_lru_maxpages is average number of
recycled buffers per cycle, that is hardly able to tune manually.


What do you mean by 'number of recycled buffers per cycle?
--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-07 Thread Greg Smith

On Wed, 7 Mar 2007, ITAGAKI Takahiro wrote:


Also, my recommended bgwriter_lru_maxpages is average number of
recycled buffers per cycle, that is hardly able to tune manually.


This is completely dependent on what percentage of your buffer cache is 
pinned.  If your load is something like the standard pgbench, the LRU 
writer will rarely find anything useful to write, so this entire line of 
thinking won't work.  The proper behavior for heavily pinned data is to 
turn off the LRU writer altogether so there's more time to run the all 
scan.


The job I'm trying to take on here is not to presume I can solve these 
problems myself yet.  I've instead recognized that people need usefully 
organized information in order to even move in that direction, and that 
informatoin is not even close to being available right now.


What my latest work in progress patches do is summarize each scan of the 
buffer pool with information about how much was written by each of the two 
writers, along with noting what percentage of the pool was pinned data. 
I'm trying to get that one ready to submit this week.  Those three values 
suggest some powerful techniques for tuning, but it's not quite good 
enough to allow auto-tuning.  It also needs a feel for how much time is 
left before the next checkpoint.


What really needs to go along with all this is a sort of progress bar that 
esimates how long we are from a checkpoint based on both a) the timeout, 
and b) how many segments have been written.  The timeout one is easy to 
work with that way (from what I read of your code, you've worked that 
angle).  The part I had trouble doing was getting the WAL writers to 
communicate a progress report on how many segments they filled back to the 
bgwriter.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-06 Thread Jim Nasby

On Mar 5, 2007, at 8:34 PM, Greg Smith wrote:

On Thu, 22 Feb 2007, Jim C. Nasby wrote:

It would also be extremely useful to make checkpoint stats visible  
somewhere in the database (presumably via the existing stats  
mechanism)... I'm thinking just tracking how many pages had to be  
flushed during a checkpoint would be a good start.


I'm in the middle of testing an updated version of the patch, once  
I nail down exactly what needs to be logged I'd planned to open a  
discussion on which of those things would be best served by  
pg_stats instead of a log.


I decided specifically to aim for the logs instead for the  
checkpoint data because if you're in a situation where are  
inserting fast enough that the checkpoints are spaced closely  
together, you'd end up having to poll pg_stats all the time for  
make sure you catch them all, which becomes even less efficient  
than just logging the data.


It's always good to be able to log stuff for detailed  
troubleshooting, so that's a good place to start. The flipside is  
that it's much easier to machine-parse a table rather than trying to  
scrape the logs. And I don't think we'll generally care about each  
individual checkpoint; rather we'll want to look at things like  
'checkpoints/hour' and 'checkpoint written pages/hour'.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-06 Thread Greg Smith

On Tue, 6 Mar 2007, Jim Nasby wrote:

The flipside is that it's much easier to machine-parse a table rather 
than trying to scrape the logs.


Now you might realize why I've been so vocal on the SQL log export 
implementation details.


And I don't think we'll generally care about each individual checkpoint; 
rather we'll want to look at things like 'checkpoints/hour' and 
'checkpoint written pages/hour'.


After a few months of staring at this data, I've found averages like that 
misleading.  The real problem areas correlate with the peak pages written 
at any one checkpoint.  Lowering that value is really the end-game for 
optimizing the background writer, and the peaks are what will nail you 
with a nasty fsync pause at checkpoint time.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-06 Thread ITAGAKI Takahiro
Greg Smith [EMAIL PROTECTED] wrote:

 After a few months of staring at this data, I've found averages like that 
 misleading.  The real problem areas correlate with the peak pages written 
 at any one checkpoint.  Lowering that value is really the end-game for 
 optimizing the background writer, and the peaks are what will nail you 
 with a nasty fsync pause at checkpoint time.

If you've already had some technical knowledge to lead best settings from
activity logs, could you write it down in the codes? I hope some kinds of
automatic control features in bgwriter if its best configurations vary by
usages or activities.


BTW, I'm planning two changes in bgwriter.

  [Load distributed checkpoint]
http://archives.postgresql.org/pgsql-patches/2007-02/msg00522.php
  [Automatic adjustment of bgwriter_lru_maxpages]
http://archives.postgresql.org/pgsql-patches/2007-03/msg00092.php

I have some results that if we have plenty of time for checkpoints,
bgwriter_all_maxpages is not a so important parameter because it is
adjusted to shared_buffers / duration of checkpoint.
Also, my recommended bgwriter_lru_maxpages is average number of
recycled buffers per cycle, that is hardly able to tune manually.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-05 Thread Greg Smith

On Thu, 22 Feb 2007, Jim C. Nasby wrote:

It would also be extremely useful to make checkpoint stats visible 
somewhere in the database (presumably via the existing stats 
mechanism)... I'm thinking just tracking how many pages had to be 
flushed during a checkpoint would be a good start.


I'm in the middle of testing an updated version of the patch, once I nail 
down exactly what needs to be logged I'd planned to open a discussion on 
which of those things would be best served by pg_stats instead of a log.


I decided specifically to aim for the logs instead for the checkpoint data 
because if you're in a situation where are inserting fast enough that the 
checkpoints are spaced closely together, you'd end up having to poll 
pg_stats all the time for make sure you catch them all, which becomes even 
less efficient than just logging the data.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-05 Thread Greg Smith

On Wed, 21 Feb 2007, Robert Treat wrote:


My impression of this is that DBA's would typically want to run this for a
short period of time to get thier systems tuned and then it pretty much
becomes chatter.  Can you come up with an idea of what information DBA's need
to know?


I am structing the log levels so that you can see all the events that are 
likely sources for extreme latency by using log level DEBUG1.  That level 
isn't too chatty and I've decided I can leave the server there forever if 
need by.  I've actually come with a much better interface to the 
background writer logging than the one I originally outlined, that I'll 
have ready for feedback in a few more days.  Now it dumps a nice summary 
report every time it completes a scan, rather than logging lower-level 
info.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-02-22 Thread Jim C. Nasby
On Mon, Feb 19, 2007 at 10:59:38PM -0500, Greg Smith wrote:
 I have a WIP patch that adds the main detail I have found I need to 
 properly tune checkpoint and background writer activity.  I think it's 
 almost ready to submit (you can see the current patch against 8.2 at 
 http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) 
 after making it a bit more human-readable.  But I've realized that along 
 with that, I need some guidance in regards to what log level is 
 appropriate for this information.
It would also be extremely useful to make checkpoint stats visible
somewhere in the database (presumably via the existing stats mechanism).
The log output is great for doing initial tuning, but you'd want to also
be able to monitor things, which would be impractical via logging. I'm
thinking just tracking how many pages had to be flushed during a
checkpoint would be a good start.
-- 
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-02-21 Thread Robert Treat
On Monday 19 February 2007 22:59, Greg Smith wrote:
 I have a WIP patch that adds the main detail I have found I need to
 properly tune checkpoint and background writer activity.  I think it's
 almost ready to submit (you can see the current patch against 8.2 at
 http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
 after making it a bit more human-readable.  But I've realized that along
 with that, I need some guidance in regards to what log level is
 appropriate for this information.

 An example works better than explaining what the patch does:

 2007-02-19 21:53:24.602 EST - DEBUG:  checkpoint required (wrote
 checkpoint_segments)
 2007-02-19 21:53:24.685 EST - DEBUG:  checkpoint starting
 2007-02-19 21:53:24.705 EST - DEBUG:  checkpoint flushing buffer pool
 2007-02-19 21:53:24.985 EST - DEBUG:  checkpoint database fsync starting
 2007-02-19 21:53:42.725 EST - DEBUG:  checkpoint database fsync complete
 2007-02-19 21:53:42.726 EST - DEBUG:  checkpoint buffer flush dirty=8034
 write=279956 us sync=17739974 us

 Remember that Load distributed checkpoint discussion back in December?
 You can see exactly how bad the problem is on your system with this log
 style (this is from a pgbench run where it's postively awful--it really
 does take over 17 seconds for the fsync to execute, and there are clients
 that are hung the whole time waiting for it).

 I also instrumented the background writer.  You get messages like this:

 2007-02-19 21:58:54.328 EST - DEBUG:  BGWriter Scan All - Written = 5/5
 Unscanned = 23/54

 This shows that we wrote (5) the maximum pages we were allowed to write
 (5) while failing to scan almost half (23) of the buffers we meant to look
 at (54).  By taking a look at this output while the system is under load,
 I found I was able to do bgwriter optimization that used to take me days
 of frustrating testing in hours.  I've been waiting for a good guide to
 bgwriter tuning since 8.1 came out.  Once you have this, combined with
 knowing how many buffers were dirty at checkpoint time because the
 bgwriter didn't get to them in time (the number you want to minimize), the
 tuning guide practically writes itself.

 So my question is...what log level should all this go at?  Right now, I
 have the background writer stuff adjusting its level dynamically based on
 what happened; it logs at DEBUG2 if it hits the write limit (which should
 be a rare event once you're tuned properly), DEBUG3 for writes that
 scanned everything they were supposed to, and DEBUG4 if it scanned but
 didn't find anything to write.  The source of checkpoint information logs
 at DEBUG1, the fsync/write info at DEBUG2.

 I'd like to move some of these up.  On my system, I even have many of the
 checkpoint messages logged at INFO (the source of the checkpoint and the
 total write time line).  It's a bit chatty, but when you get some weird
 system pause issue it makes it easy to figure out if checkpoints were to
 blame.  Given how useful I feel some of these messages are to system
 tuning, and to explaining what currently appears as inexplicable pauses, I
 don't want them to be buried at DEBUG levels where people are unlikely to
 ever see them (I think some people may be concerned about turning on
 things labeled DEBUG at all).  I am aware that I am too deep into this to
 have an unbiased opinion at this point though, which is why I ask for
 feedback on how to proceed here.


My impression of this is that DBA's would typically want to run this for a 
short period of time to get thier systems tuned and then it pretty much 
becomes chatter.  Can you come up with an idea of what information DBA's need 
to know?  Would it be better to hide all of this logging behind a guc that 
can be turned on or off (log_bgwriter_activity)? Maybe you could just 
reinsterment it as dtrace probes so a seperate stand-alone process could pull 
the information as needed? :-)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly