Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-25 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 Tom gets credit for naming the attached patch, which is my latest attempt to 
 finalize what has been called the Automatic adjustment of 
 bgwriter_lru_maxpages patch for 8.3; that's not what it does anymore but 
 that's where it started.

I've applied this patch with some revisions.

 -The way I'm getting the passes number back from the freelist.c
 strategy code seems like it will eventually overflow

Yup ... I rewrote that.  I also revised the collection of backend-write
count events, which didn't seem to me to be something the freelist.c
code should have anything to do with.  It turns out that we can count
them with essentially no overhead by attaching the counter to
the existing fsync-request reporting machinery.

 -Heikki didn't like the way I pass information back from SyncOneBuffer
 back to the background writer.

I didn't either --- it was too complicated and not actually doing
anything useful.  I simplified it down to the two bits that were being
used.  We can always add more as needed, but since this routine isn't
even exported, I see no need to make it do more than the known callers
need it to do.

I did some marginal tweaking to the way you were doing the moving
averages --- in particular, use a float to avoid strange roundoff
behavior and force the smoothed_alloc average up when a new peak
occurs, instead of only letting it affect the behavior for one
cycle.

Also, I set the default value of bgwriter_lru_multiplier to 2.0,
as 1.0 seemed to be leaving too many writes to the backends in my
testing.  That's something we can play with during beta when we'll
have more testing resources available.

I did some other cleanup in BgBufferSync too, like trying to reduce
the chattiness of the debug output, but I don't believe I made any
fundamental change in your algorithm.

Nice work --- thanks for seeing it through!

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-25 Thread Greg Smith

On Tue, 25 Sep 2007, Tom Lane wrote:


-Heikki didn't like the way I pass information back from SyncOneBuffer
back to the background writer.

I didn't either --- it was too complicated and not actually doing
anything useful.


I suspect someone (possibly me) may want to put back some of that same 
additional complication in the future, but I'm fine with it not being 
there yet.  The main thing I wanted accomplished was changing the return 
to a bitmask of some sort and that's there now; adding more data to that 
interface later is at least easier now.



Also, I set the default value of bgwriter_lru_multiplier to 2.0,
as 1.0 seemed to be leaving too many writes to the backends in my
testing.


The data I've collected since originally submitting the patch agrees that 
2.0 is probably a better default as well.


I should have time to take an initial stab this week at updating the 
documentation to reflect what's now been commited, and to see how this 
stacks on top of HOT running pgbench on my test system.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-18 Thread Greg Smith
It was suggested to me today that I should clarify how others should be 
able to test this patch themselves by writing a sort of performance 
reviewer's guide; that information has been scattered among material 
covering development.  That's what you'll find below.  Let me know if any 
of it seems confusing and I'll try to clarify.  I'll be checking my mail 
and responding intermitantly while I'm away, just won't be able to run any 
tests myself until next week.


The latest version of the background writer code that I've been reporting 
on is attached to the first message in this thread:


http://archives.postgresql.org/pgsql-hackers/2007-09/msg00214.php

I haven't found any reason so far to update that code, the existing 
exposed tunables still appear sufficient for all the situations I've 
found.


Track Buffer Allocations and Cleaner Efficiency
---

First you apply the patch inside buf-alloc-2.patch.gz , which adds several 
entries to pg_stat_bgwriter; it applied cleanly to HEAD at the point when 
I generated it.  I'd suggest testing that one to collect baseline 
information with the current background writer, and to confirm that the 
overhead of tracking the buffer allocations by itself doesn't cause a 
performance hit, before applying the second patch.  I keep two clusters 
going on the same port, one with just buf-alloc-2, one with both patches, 
to be able to make such comparisions, only having one active at a time. 
You'll need to run initdb to create a database with the new stats in it 
after applying the patch.


What I've been doing to test the effectiveness of any LRU background 
writer method using this patch is take a before/after snapshot of 
pg_stat_bgwriter.  Then I compute the delta during the test run in order 
to figure what percentage of buffers were written by the background writer 
vs. the client backends; that's the number I'm reporting as cleaner_pct in 
my tests.  Here is an example of how to compute that against all 
transactions in pg_stat_bgwriter:


select round(buffers_clean * 1 / (buffers_backend + buffers_clean)) / 
100 as cleaner_pct from pg_stat_bgwriter;


You should also monitor maxwritten_clean to make sure you've set 
bgwriter_lru_maxpages high enough that it's not limiting writes.  You can 
always turn the background writer off by setting maxpages to 0 (it's the 
only way to do so after applying the below patch).


For reference, the exact code I'm using to save the deltas and compute 
everything is available within pgbench-tools-0.2 at 
http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm


The code inside the benchwarmer script uses a table called test_bgwriter 
(schema in init/resultdb.sql), populates it before the test, then computes 
the delta afterwards.  bufsummary.sql generates the results I've been 
putting in my messages.  I assume there's a cleaner way to compute just 
these numbers by resetting the statistics before the test instead, but 
that didn't fit into what I was working towards.


New Background Writer Logic
---

The second patch in jit-cleaner.patch.gz applies on top of buf-alloc-2. 
It modifies the LRU background writer with the just-in-time logic as I 
described in the message the patches were attached to.  The main tunable 
there is bgwriter_lru_multiplier, which replaces bgwriter_lru_percent. 
The effective range seems to be 1.0 to 3.0.  You can take an existing 8.3 
postgresql.conf, rename bgwriter_lru_percent to bgwriter_lru_multiplier, 
adjust the value to be in the right range, and then it will work with this 
patched version.


For comparing the patched vs. original BGW behavior, I've taken to keeping 
definitions for both variables in a common postgresql.conf, and then I 
just comment/uncomment the one I need based on which version I'm running:


bgwriter_lru_multiplier = 1.0
#bgwriter_lru_percent = 5

The main thing I've noticed so far is that as you decrease bgwriter_delay 
from the default of 200ms, the multiplier has needed to be larger to 
maintain the same cleaner percentage in my tests.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-17 Thread Greg Smith

On Sat, 8 Sep 2007, Greg Smith wrote:

Here's the results I got when I pushed the time down significantly from the 
defaults

info  | set | tps  | cleaner_pct
---+-+--+-
jit multiplier=1.0 scan_whole=120s delay=20ms |  20 |  956 |   92.34
jit multiplier=2.0 scan_whole=120s delay=20ms |  21 |  967 |   99.94

jit multiplier=1.5 scan_whole=120s delay=10ms |  22 |  944 |   97.91
jit multiplier=2.0 scan_whole=120s delay=10ms |  23 |  981 |99.7
It seems I have to push the multiplier higher to get good results when using 
a much lower interval


Since I'm not exactly overwhelmed processing field reports, I've continued 
this line of investigation myself...increasing the multiplier to 3.0 got 
me another nine on the buffers written by the LRU BGW without a 
significant change in performance:


 info  | set | tps  | cleaner_pct
---+-+--+-
jit multiplier=3.0 scan_whole=120s delay=10ms  |  24 |  967 | 99.95

After thinking for a bit about why the 10ms case wasn't working so well 
without a big multiplier, I considered that the default moving average 
smoothing makes the sample period operating over such a short period of 
time (10ms * 16=160ms) that it's unlikely to cover a typical pause that 
one might want to smooth over.  My initial thinking was to increase the 
period of the smoothing so that it's of similar length to the default case 
even when the interval goes down, but that didn't really improve anything 
(note that the 16 case here is the default setup with just the delay at 
10ms, which was a missing piece of data from the above as well--I only 
tested with larger multipliers above at 10ms):


 info | set | tps  | cleaner_pct
--+-+--+-
 jit multiplier=1.0 delay=10ms smoothing=16   |  27 |  982 |  89.4
 jit multiplier=1.0 delay=10ms smoothing=64   |  26 |  946 |  89.55
 jit multiplier=1.0 delay=10ms smoothing=320  |  25 |  970 |  89.53

What I realized is that after rounding the number of buffers to an 
integer, dividing a very short period of activity by the smoothing 
constant was resulting in the smoothing value usually dropping to 0 and 
not doing much.  This made me wonder how much the weighted average 
smoothing was really doing in the default case.  I put that code in months 
ago and I hadn't looked recently at its effectiveness.  Here's a 
comparison:


 info | set | tps  | cleaner_pct
--+-+--+-
 jit multiplier=1.0 delay=200ms smoothing=16  |  18 |  970 |  99.99
 jit multiplier=1.0 delay=200ms smoothing=off |  28 |  957 |  97.16

All this data support my suggestion that the exact value of the smoothing 
period constant isn't really a critical one.  It appears moderately 
helpful to have that logic on in some cases and the default value doesn't 
seem to hurt the cases where I'd expect it to be the least effective. 
Tuning the multiplier is much more powerful and useful than ever touching 
this constant.  I could probably even pull the smoothing logic out 
altogether, at the cost of increasing the burden of correctly tuning the 
multiplier on the administrator.  So far it looks like it's reasonable 
instead to leave it as an untunable to help the default configuration, and 
I'll just add a documentation note that if you decrease the interval 
you'll probably have to increase the multiplier.


After going through this, the extra data gives more useful baselines to do 
a similar sensitivity analysis of the other item that's untunable in the 
current patch:


float   scan_whole_pool_seconds = 120.0;

But I'll be travelling for the next week and won't have time to look into 
that myself until I get back.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Greg Smith

On Fri, 7 Sep 2007, Simon Riggs wrote:


For me, the bgwriter should sleep for at most 10ms at a time.


Here's the results I got when I pushed the time down significantly from 
the defaults, with some of the earlier results for comparision:


 info  | set | tps  | cleaner_pct
---+-+--+-
 jit multiplier=2.0 scan_whole=120s delay=200ms|  17 |  981 |   99.98
 jit multiplier=1.0 scan_whole=120s delay=200ms|  18 |  970 |   99.99

 jit multiplier=1.0 scan_whole=120s delay=20ms |  20 |  956 |   92.34
 jit multiplier=2.0 scan_whole=120s delay=20ms |  21 |  967 |   99.94

 jit multiplier=1.5 scan_whole=120s delay=10ms |  22 |  944 |   97.91
 jit multiplier=2.0 scan_whole=120s delay=10ms |  23 |  981 |99.7

It seems I have to push the multiplier higher to get good results when 
using a much lower interval, which was expected, but the fundamentals all 
scale down to the running much faster the way I'd hoped.


I'm tempted to make the default 10ms, adjust some of the other constants 
just a bit to optimize better for that time scale:  make the default 
multiplier 2.0, increase the weighted average sample period, and perhaps 
reduce scan_whole a bit because that's barely doing anything at 10ms.  If 
no one discovers any problems with working that way during beta, then 
consider locking them in for the RC.  That would leave just the multiplier 
and maxpages as the exposed tunables, and it's very easy to tune maxpages 
just by watching pg_stat_bgwriter.  This would obviously be a very 
aggressive plan--it would be eliminating GUCs and reducing flexibility for 
people in the field, aiming instead at making this more automatic for the 
average case.


If anyone has a reason why they feel the bgwriter_delay needs to be a 
tunable or why the rate might need to run even faster than 10ms, now would 
be a good time to say why.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 If anyone has a reason why they feel the bgwriter_delay needs to be a 
 tunable or why the rate might need to run even faster than 10ms, now would 
 be a good time to say why.

You'd be hard-wiring the thing to wake up 100 times per second?  Doesn't
sound like a good plan from here.  Keep in mind that not everyone wants
their machine to be dedicated to Postgres, and some people even would
like their CPU to go to sleep now and again.

I've already gotten flak about the current default of 200ms:
https://bugzilla.redhat.com/show_bug.cgi?id=252129
I can't imagine that folk with those types of goals will tolerate
an un-tunable 10ms cycle.

In fact, given the numbers you show here, I'd say you should leave the
default cycle time at 200ms.  The 10ms value is eating way more CPU and
producing absolutely no measured benefit relative to 200ms...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Greg Smith

On Sat, 8 Sep 2007, Tom Lane wrote:

I've already gotten flak about the current default of 200ms: 
https://bugzilla.redhat.com/show_bug.cgi?id=252129
I can't imagine that folk with those types of goals will tolerate an 
un-tunable 10ms cycle.


That's the counter-example for why lowering the default is unacceptable I 
was looking for.  Scratch bgwriter_delay off the list of things that might 
be fixed to a specific value.


Will return to the drawing board to figure out a way to incorporate what 
I've learned about running at 10ms into a tuning plan that still works 
fine at 200ms or higher.  The good news as far as I'm concerned is that I 
haven't had to adjust the code so far, just tweak the existing knobs.


In fact, given the numbers you show here, I'd say you should leave the 
default cycle time at 200ms.  The 10ms value is eating way more CPU and 
producing absolutely no measured benefit relative to 200ms...


My server is a bit underpowered to run at 10ms and gain anything when 
doing a stress test like this; I was content that it didn't degrade 
performance significantly, that was the best I could hope for.  I would 
expect the class of systems that Simon and Heikki are working with could 
show significant benefit from running the BGW that often.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 On Sat, 8 Sep 2007, Tom Lane wrote:
 In fact, given the numbers you show here, I'd say you should leave the 
 default cycle time at 200ms.  The 10ms value is eating way more CPU and 
 producing absolutely no measured benefit relative to 200ms...

 My server is a bit underpowered to run at 10ms and gain anything when 
 doing a stress test like this; I was content that it didn't degrade 
 performance significantly, that was the best I could hope for.  I would 
 expect the class of systems that Simon and Heikki are working with could 
 show significant benefit from running the BGW that often.

Quite possibly.  So it sounds like we still need to expose
bgwriter_delay as a tunable.

It might be interesting to consider making the delay auto-tune: if you
wake up and find nothing (much) to do, sleep longer the next time,
conversely shorten the delay when work picks up.  Something for 8.4,
though, at this point.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Gregory Stark
Greg Smith [EMAIL PROTECTED] writes:

 On Sat, 8 Sep 2007, Tom Lane wrote:

 I've already gotten flak about the current default of 200ms:
 https://bugzilla.redhat.com/show_bug.cgi?id=252129
 I can't imagine that folk with those types of goals will tolerate an
 un-tunable 10ms cycle.

 That's the counter-example for why lowering the default is unacceptable I was
 looking for.  Scratch bgwriter_delay off the list of things that might be 
 fixed
 to a specific value.

Ok, time for the obligatory contrarian voice here. It's all well and good to
aim to eliminate GUC variables but I don't think it's productive to do so by
simply hard-wiring them. 

Firstly that doesn't really make life any easier than simply finding good
defaults and documenting that DBAs probably shouldn't be bothering to tweak
them.

Secondly it's unlikely to work. The variables under consideration may have
reasonable defaults but they're not likely to have defaults will work in every
case. This example is pretty typical. There aren't many variables that will
have a reasonable default which will work for both an interactive desktop
where Postgres is running in the background and Sun's 1000+ process
benchmarks.

What I think is more likely to work is looking for ways to make these
variables auto-tuning. That eliminates the knob not by just hiding it away and
declaring it doesn't exist but by architecting the system so that there really
is no knob that might need tweaking.

Perhaps what would work better here is having a semaphore which bgwriter
sleeps on which backends wake up whenever the clock sweep hand completes a
cycle. Or gets within a certain fraction of a cycle of catching up.

Or perhaps bgwriter shouldn't be adjusting the number of pages it processes at
all and instead it should only be adjusting the sleep time. So it would always
process a full cycle for example but adjust the sleep time based on what
percentage of the cycle the backends used up in the last sleep time.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Greg Smith

On Sat, 8 Sep 2007, Tom Lane wrote:


It might be interesting to consider making the delay auto-tune: if you
wake up and find nothing (much) to do, sleep longer the next time,
conversely shorten the delay when work picks up.  Something for 8.4,
though, at this point.


I have a couple of pages of notes on how to tune the delay automatically. 
The tricky part are applications that go from 0 to full speed with little 
warning; the first few seconds of the stock market open come to mind. 
What I was working toward was considering what you set the delay to as a 
steady-state value, and then the delay cranks downward as activity levels 
go up.  As activity dies off, it slowly returns to the default again.


But I realized that I needed to get all this other stuff working, all the 
statistics counters exposed usefully, and then collect a lot more data 
before I could implement that plan.  Definately something that might fit 
into 8.4, completely impossible for 8.3.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Greg Smith

On Thu, 6 Sep 2007, Decibel! wrote:


I don't know that there should be a direct correlation, but ISTM that
scan_whole_pool_seconds should take checkpoint intervals into account
somehow.


Any direct correlation is weak at this point.  The LRU cleaner has a small 
impact on checkpoints, in that it's writing out buffers that may make the 
checkpoint quicker.  But this particular write trickling mechanism is not 
aimed directly at flushing the whole pool; it's more about smoothing out 
idle periods a bit.


Also, computing the checkpoint interval is itself tricky.  Heikki had to 
put some work into getting something that took into account both the 
timeout and segments mechanisms to gauge progress, and I'm not sure I can 
directly re-use that because it's really only doing that while the 
checkpoint is active.  I'm not saying it's a bad idea to have the expected 
interval as an input to the model, just that it's not obvious to me how to 
do it and whether it would really help.



I like the idea of not having that as a GUC, but I'm doubtful that it
can be hard-coded like that. What if checkpoint_timeout is set to 120?
Or 60? Or 2000?


Someone using 60 or 120 has checkpoint problems way bigger than the LRU 
cleaner can be expected to help with.  How fast the reusable buffers it 
can write are pushed out is the least of their problems.  Also, I'd expect 
that the only cases using such a low value for a good reason are doing so 
because they have enormous amounts of activity on their system, and in 
that case the primary JIT mechanism should dominate how the LRU cleaner 
treats them.  scan_whole_pool_seconds doesn't do anything if the primary 
mechanism was already planning to scan more buffers than it aims for.


Someone who has very infrequent checkpoints and therefore low activity, 
like your 2000 case, can expect that the LRU cleaner will lap and catch up 
to the strategy point about 2 minutes after any activity and then follow 
directly behind it with the way I've set this up.  If that's cleaning the 
buffer cache too aggressively, I think those in that situation would be 
better served by constraining the maxpages parameter; that's directly 
adjusting what I'd expect their real issue is, how fast pages can flush to 
disk, rather than the secondary one of how fast the pool is being scanned.


I picked 2 minutes for that value because it's as slow as I can make it 
and still serve its purpose, while not feeling to me like it's too fast 
for a relatively idle system even if someone set maxpages=1000.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-08 Thread Alvaro Herrera
Greg Smith wrote:
 On Sat, 8 Sep 2007, Tom Lane wrote:

 It might be interesting to consider making the delay auto-tune: if you
 wake up and find nothing (much) to do, sleep longer the next time,
 conversely shorten the delay when work picks up.  Something for 8.4,
 though, at this point.

 I have a couple of pages of notes on how to tune the delay automatically. 
 The tricky part are applications that go from 0 to full speed with little 
 warning; the first few seconds of the stock market open come to mind.

Maybe have the backends send a signal to bgwriter when they see it
sleeping and are overwhelmed by work.  That way, bgwriter can sleep for
a few seconds, safe in the knowledge that somebody else will wake it up
if needed sooner.  The way backends would detect that bgwriter is
sleeping is that bgwriter would keep an atomic flag in shared memory,
and it gets set only if it's going to sleep for long (so if it's going
to sleep for (say) 100ms or less, it doesn't set the flag, so the
backends won't signal it).  In order to avoid a huge amount of signals
when all backends suddenly start working at the same instant, have the
signal itself be sent only by the first backend that manages to
LWLockConditionalAcquire a lwlock that's only used for that purpose.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Simon Riggs
On Fri, 2007-09-07 at 11:48 -0400, Greg Smith wrote:
 On Fri, 7 Sep 2007, Simon Riggs wrote:
 
  I think that is what we should be measuring, perhaps in a simple way 
  such as calculating the 90th percentile of the response time 
  distribution.
 
 I do track the 90th percentile numbers, but in these pgbench tests where 
 I'm writing as fast as possible they're actually useless--in many cases 
 they're *smaller* than the average response, because there are enough 
 cases where there is a really, really long wait that they skew the average 
 up really hard.  Take a look at any of the inidividual test graphs and 
 you'll see what I mean.

I've looked at the graphs now, but I'm not any wiser, I'm very sorry to
say. We need something like a frequency distribution curve, not just the
actual times. Bottom line is we need a good way to visualise the
detailed effects of the patch.

I think we should do some more basic tests to see where those outliers
come from. We need to establish a clear link between number of dirty
writes and response time. If there is one, which we all believe, then it
is worth minimising those with these techniques. We might just be
chasing the wrong thing.

Perhaps output the number of dirty blocks written on the same line as
the output of log_min_duration_statement so that we can correlate
response time to dirty-block-writes on that statement.

For me, we can enter Beta while this is still partially in the air. We
won't be able to get this right without lots of other feedback. So I
think we should concentrate now on making sure we've got the logging in
place so we can check whether your patch works when its out there. I'd
say lets include what you've done and then see how it works during Beta.
We've been trying to get this right for years now, so we have to allow
some slack to make sure we get this right. We can reduce or strip out
logging once we go RC.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Simon Riggs
On Wed, 2007-09-05 at 23:31 -0400, Greg Smith wrote:

 Tom gets credit for naming the attached patch, which is my latest attempt to 
 finalize what has been called the Automatic adjustment of 
 bgwriter_lru_maxpages patch for 8.3; that's not what it does anymore but 
 that's where it started.

This is a big undertaking, so well done for going for it.

 I decided to use pgbench for running my tests.  The scripting framework to 
 collect all that data and usefully summarize it is now available as 
 pgbench-tools-0.2 at 
 http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm

For me, the main role of the bgwriter is to avoid dirty writes in
backends. The purpose of doing that is to improve the response time
distribution as perceived by users. I think that is what we should be
measuring, perhaps in a simple way such as calculating the 90th
percentile of the response time distribution. Looking at the headline
numbers especially tps is notoriously difficult to determine any
meaning from test results. 

Looking at the tps also tempts us to run a test which maxes out the
server, an area we already know and expect the bgwriter to be unhelpful
in.

If I run a server at or below 70% capacity, what settings of the
bgwriter help maintain my response time distribution?

 Coping with idle periods
 
 
 While I was basically happy with these results, the data Kevin Grittner 
 submitted in response to my last call for commentary left me concerned. While 
 the JIT approach works fine as long as your system is active, it does 
 absolutely nothing if the system is idle.  I noticed that a lot of the writes 
 that were being done by the client backends were after idle periods where the 
 JIT writer just didn't react fast enough during the ramp-up.  For example, if 
 the system went from idle for a while to full-speed just as the 200ms sleep 
 started, by the time the BGW woke up again the backends could have needed to 
 write many buffers already themselves.

You've hit the nail on the head there. I can't see how you can do
anything sensible when the bgwriter keeps going to sleep for long
periods.

The bgwriter's activity curve should ideally be the same shape as a
critically damped harmonic oscillator. It should wake up, lots of
writing if needed, then trail off over time. The only way to do that
seems to be to vary the sleep automatically, or make short sleeps.

For me, the bgwriter should sleep for at most 10ms at a time. If it has
nothing to do it can go straight back to sleep again. Trying to set that
time is fairly difficult, so it would be better not to have to set it at
all.

If you've changed bgwriter so it doesn't scan if no blocks have been
allocated, I don't see any reason to keep the _delay parameter at all.

 I think I can safely say there is a level of intelligence going into what the 
 LRU background writer does with this patch that has never been applied to 
 this 
 problem before.  There have been a lot of good ideas thrown out in this area, 
 but it took a hybrid approach that included and carefully balanced all of 
 them 
 to actually get results that I felt were usable. What I don't know is whether 
 that will also be true for other testers.

I get the feeling that what we have here is better than what we had
before, but I guess I'm a bit disappointed we still have 3 magic
parameters, or 5 if you count your hard-coded ones also.

There's still no formal way to tune these. As long as we have *any*
magic parameters, we need a way to tune them in the field, or they are
useless. At very least we need a plan for how people will report results
during Beta. That means we need a log_bgwriter (better name, please...)
parameter that provides information to assist with tuning. At the very
least we need this to be present during Beta, if not beyond.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Greg Smith

On Fri, 7 Sep 2007, Simon Riggs wrote:

I think that is what we should be measuring, perhaps in a simple way 
such as calculating the 90th percentile of the response time 
distribution.


I do track the 90th percentile numbers, but in these pgbench tests where 
I'm writing as fast as possible they're actually useless--in many cases 
they're *smaller* than the average response, because there are enough 
cases where there is a really, really long wait that they skew the average 
up really hard.  Take a look at any of the inidividual test graphs and 
you'll see what I mean.



Looking at the tps also tempts us to run a test which maxes out the
server, an area we already know and expect the bgwriter to be unhelpful
in.


I tried to turn that around and make my thinking be that if I built a 
bgwriter that did most of the writes without badly impacting the measure 
we know and expect it to be unhelpful in, that would be more likely to 
yield a robust design.  It kept me out of areas where I might have built 
something that had to be disclaimed with don't run this when the server 
is maxed out.



For me, the bgwriter should sleep for at most 10ms at a time. If it has
nothing to do it can go straight back to sleep again. Trying to set that
time is fairly difficult, so it would be better not to have to set it at
all.


I wanted to get this patch out there so people could start thinking about 
what I'd done and consider whether this still fit into the 8.3 timeline. 
What I'm doing myself right now is running tests with a much lower setting 
for the delay time--am testing 20ms right now.  I personally would be 
happy saying it's 10ms and that's it.  Is anyone using a time lower than 
that right now?  I seem to recall that 10ms was also the shortest interval 
Heikki used in his tests as well.



I get the feeling that what we have here is better than what we had
before, but I guess I'm a bit disappointed we still have 3 magic
parameters, or 5 if you count your hard-coded ones also.


I may be able to eliminate more of them, but I didn't want to take them 
out before beta.  If it can be demonstrated that some of these parameters 
can be set to specific values and still work across a wider range of 
applications than what I've tested, then there's certainly room to fix 
some of these, which actually makes some things easier.  For example, I'd 
be more confident fixing the weighted average smoothing period to a 
specific number if I knew the delay was fixed, and there's two parameters 
gone.  And the multiplier is begging to be eliminated, just need some more 
data to confirm that's true.



There's still no formal way to tune these. As long as we have *any*
magic parameters, we need a way to tune them in the field, or they are
useless. At very least we need a plan for how people will report results
during Beta. That means we need a log_bgwriter (better name, please...)
parameter that provides information to assist with tuning.


Once I got past the does it work? stage, I've been doing all the tuning 
work using a before/after snapshot of pg_stat_bgwriter data during a 
representative snapshot of activity and looking at the delta.  Been a 
while since I actually looked into the logs for anything.  It's very 
straightforward to put together a formal tuning plan using the data in 
there, particularly compared to the the impossibility of creating such a 
plan in the current code.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-07 Thread Greg Smith

On Fri, 7 Sep 2007, Simon Riggs wrote:

I think we should do some more basic tests to see where those outliers 
come from. We need to establish a clear link between number of dirty 
writes and response time.


With the test I'm running, which is specifically designed to aggrevate 
this behavior, the outliers on my system come from how Linux buffers 
writes.  I can adjust them a bit by playing with the parameters as 
described at http://www.westnet.com/~gsmith/content/linux-pdflush.htm but 
on the hardware I've got here (single 7200RPM disk for database, another 
for WAL) they don't move much.  Once /proc/meminfo shows enough Dirty 
memory that pdflush starts blocking writes, game over; you're looking at 
multi-second delays before my plain old IDE disks clear enough debris out 
to start responding to new requests even with the Areca controller I'm 
using.



Perhaps output the number of dirty blocks written on the same line as
the output of log_min_duration_statement so that we can correlate
response time to dirty-block-writes on that statement.


On Linux at least, I'd expect this won't reveal much.  There, the 
interesting correlation is with how much dirty data is in the underlying 
OS buffer cache.  And exactly how that plays into things is a bit strange 
sometimes.  If you go back to Heikki's DBT2 tests with the background 
writer schemes he tested, he got frustrated enough with that disconnect 
that he wrote a little test program just to map out the underlying 
weirdness: 
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php


I've confirmed his results on my system and done some improvements to that 
program myself, but pushed further work on it to the side to finish up the 
main background writer task instead.  I may circle back to that.  I'd 
really like to run all this on another OS as well (I have Solaris 10 on my 
server box but not fully setup yet), but I can only volunteer so much time 
to work on all this right now.


If there's anything that needs to be looked at more carefully during tests 
in this area, it's getting more data about just what the underlying OS is 
doing while all this is going on.  Just the output from vmstat/iostat is 
very informative.  Those using DBT2 for their tests get some nice graphs 
of this already.  I've done some pgbench-based tests that included that 
before that were very enlightening but sadly that system isn't available 
to me anymore.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-06 Thread Greg Smith

On Thu, 6 Sep 2007, Kevin Grittner wrote:


If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
allay all of my concerns about this patch.  Basically, our problems were
resolved by getting all dirty buffers out to the OS cache within two
seconds


Unfortunately it wouldn't make my concerns about your system go away or 
I'd have recommended exposing it specifically to address your situation. 
I have been staring carefully at your configuration recently, and I would 
wager that you could turn off the LRU writer altogether and still meet 
your requirements in 8.2.  Here's what you've got right now:



shared_buffers = 160MB (=2 buffers)
bgwriter_lru_percent = 20.0
bgwriter_lru_maxpages = 200
bgwriter_all_percent = 10.0
bgwriter_all_maxpages = 600


With the default delay of 200ms, this has the LRU-writer scanning the 
whole pool every 1 second, while the all-writer scans every two 
seconds--assuming they don't hit the write limits.  If some event were to 
dirty the whole pool in 200ms, it might take as much as 6.7 seconds to 
write everything out (2 / 600 * 200 ms) via the all-scan.  The 
all-scan is already gone in 8.3.  Your LRU scan will take much longer than 
that to clear everything out.  At least (2 / 200 * 200ms) 20 seconds 
to clear a fully dirty cache.


But in fact, it's impossible to even bound how long it will take before 
the LRU writer (which is the only part this new patch tries to improve) 
gets around to writing even a single dirty buffer no matter what 
bgwriter_lru_percent (8.2) or scan_whole_pool_seconds (JIT patch) is set 
to.


There's a second low-level issue involved here.  When a page becomes 
dirty, that implies it was also recently used, which means the LRU writer 
won't touch it.  That page can't be written out by the LRU writer until an 
entire pass has been made over the shared_buffer pool while looking for 
buffers to allocate for new activity.  When the allocation clock-sweep 
passes over the newly dirtied buffer again, its usage count will drop by 
one and it will no longer be considered recently used.  At that point the 
LRU writer can write it out.  So unless there is other allocation activity 
going on, the scan_whole_pool_seconds mechanism will never provide the 
bound on time to scan and write everything you hope it will.


And if there's other allocations going on, the much more powerful JIT 
mechanism will scan the whole pool plenty fast if you bump the already 
exposed multiplier tunable up.  In my tests where the buffer cache was 
filled with mostly dirty buffers that couldn't be re-used (something 
relatively easy to trigger with pgbench tests), I've actually watched the 
new code scan 90% of the buffer cache looking for those few reusable 
buffers in the pool in a single invocation.  This would be like setting 
bgwriter_lru_percent=90.0 in the old configuration, but it only gets that 
aggressive when the distribution of pages in the buffer cache demands it, 
and when it has reason to believe going that fast will be helpful.


The completely understandable line of thinking that led to your request 
here is one of my concerns with exposing scan_whole_pool_seconds as a 
tunable.  It may suggest to people that if they set the number very low, 
it will assure all dirty buffers will be scanned and written within that 
time bound.  That's certainly not the case; both the maxpages and the 
usage count information will actually drive the speed that mechanism plods 
through the buffer cache.  It really isn't useful for scanning fast.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-06 Thread Kevin Grittner
 On Wed, Sep 5, 2007 at 10:31 PM, in message
[EMAIL PROTECTED], Greg Smith
[EMAIL PROTECTED] wrote: 
 
 -There are two magic constants in the code:
 
  int smoothing_samples = 16;
  float   scan_whole_pool_seconds = 120.0;
 

 I personally 
 don't feel like these constants need to be exposed for tuning purposes;

 Determining 
 whether these should be exposed as GUC tunables is certainly an open 
 question though.
 
If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
allay all of my concerns about this patch.  Basically, our problems were
resolved by getting all dirty buffers out to the OS cache within two
seconds; any longer than that and the OS cache didn't reach its trigger
point for pushing out to the controller cache in time to prevent the glut
which locks everything up.  I also suspect that this interval kept the OS
cache more aware of frequently updated pages, so that it could avoid
unnecessary physical writes under its own logic.
 
While I'm hoping that the new checkpoint techniques will be a better
solution, I can't count on that without significant testing in our
environment, and I really want a fall-back.  The metric you emphasized was
the percentage of PostgreSQL writes to the OS cache which were handled by
the background writer, which doesn't necessarily correspond to a solution
to the glut, which is based on the peak number of total writes presented
to the controller by the OS within a small window of time.
 
-Kevin
 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-06 Thread Kevin Grittner
 On Thu, Sep 6, 2007 at 11:27 AM, in message
[EMAIL PROTECTED], Greg Smith
[EMAIL PROTECTED] wrote: 
 On Thu, 6 Sep 2007, Kevin Grittner wrote:
 
 I have been staring carefully at your configuration recently, and I would 
 wager that you could turn off the LRU writer altogether and still meet 
 your requirements in 8.2.
 
I totally agree that it is of minor benefit compared to the all-writer,
if it even matters at all.  I knew that when I chose the settings.
 
 Here's what you've got right now:
 
 shared_buffers = 160MB (=2 buffers)
 bgwriter_lru_percent = 20.0
 bgwriter_lru_maxpages = 200
 bgwriter_all_percent = 10.0
 bgwriter_all_maxpages = 600
 
 With the default delay of 200ms, this has the LRU-writer scanning the 
 whole pool every 1 second,
 
Whoa!  Apparently I've totally misread the documentation.  I thought that
the bgwriter_lru_percent was scanned from the lru end each time; I would
not expect that it would ever get beyond the oldest 10%.  I put that in
just as a guard to keep the backends from having to wait for the OS write.
I've always doubted whether it was helping, but it wasn't broke
 
 while the all-writer scans every two 
 seconds--assuming they don't hit the write limits.  If some event were to 
 dirty the whole pool in 200ms, it might take as much as 6.7 seconds to 
 write everything out (2 / 600 * 200 ms) via the all-scan.
 
Right.  Since the file system didn't seem to be able to accept writes
faster than 800 PostgreSQL pages per second, and I wanted to leave a
LITTLE slack, I set that limit.  We don't seem to hit it, as far as I can
tell.  In fact, the output rate would be naturally fairly smooth, if not
for the hold all dirty pages until the last possible moment, then write
them all to the OS and fsync approach.
 
 There's a second low-level issue involved here.  When a page becomes 
 dirty, that implies it was also recently used, which means the LRU writer 
 won't touch it.  That page can't be written out by the LRU writer until an 
 entire pass has been made over the shared_buffer pool while looking for 
 buffers to allocate for new activity.  When the allocation clock-sweep 
 passes over the newly dirtied buffer again, its usage count will drop by 
 one and it will no longer be considered recently used.  At that point the 
 LRU writer can write it out.
 
How low does the count have to go, or does it track the count when it
becomes dirty and look for a decrease?
 
 So unless there is other allocation activity 
 going on, the scan_whole_pool_seconds mechanism will never provide the 
 bound on time to scan and write everything you hope it will.
 
That may not be an issue for the environment where this has been a problem
for us -- the web hits are coming in at a pretty good rate 24/7.  (We have
a couple dozen large companies scanning data through HTTP SOAP requests
all the time.)  This should keep us reading new pages, which covers this,
yes?
 
 where the buffer cache was 
 filled with mostly dirty buffers that couldn't be re-used
 
That would be the condition that would be the killer with a synchronous
checkpoint if the OS cache has already had some dirty pages trickled out.
If we can hit this condition in our web database, either the load
distributed checkpoint will save us, or we can't use 8.3.  Period.
 
 The completely understandable line of thinking that led to your request 
 here is one of my concerns with exposing scan_whole_pool_seconds as a 
 tunable.  It may suggest to people that if they set the number very low, 
 it will assure all dirty buffers will be scanned and written within that 
 time bound.  That's certainly not the case; both the maxpages and the 
 usage count information will actually drive the speed that mechanism plods 
 through the buffer cache.  It really isn't useful for scanning fast.
 
I'm not clear on the benefit of not writing the recently accessed dirty
pages when there are no less recently used dirty pages.  I do trust the OS
to not write them before they age out in that cache, and the OS cache
doesn't start writing dirty pages from its cache until they reach a
certain percentage of the cache space, so I'd just as soon let the OS know
that the MRU dirty pages are there, so it knows that it's time to start
working on the LRU pages in its cache.
 
-Kevin
 


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-06 Thread Tom Lane
Kevin Grittner [EMAIL PROTECTED] writes:
 On Thu, Sep 6, 2007 at 11:27 AM, in message
 [EMAIL PROTECTED], Greg Smith
 [EMAIL PROTECTED] wrote: 
 With the default delay of 200ms, this has the LRU-writer scanning the 
 whole pool every 1 second,
  
 Whoa!  Apparently I've totally misread the documentation.  I thought that
 the bgwriter_lru_percent was scanned from the lru end each time; I would
 not expect that it would ever get beyond the oldest 10%.

I believe you're correct and Greg got this wrong.  I won't draw any
conclusions about whether the LRU stuff is actually doing you any good
though.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-06 Thread Greg Smith

On Thu, 6 Sep 2007, Kevin Grittner wrote:

I thought that the bgwriter_lru_percent was scanned from the lru end 
each time; I would not expect that it would ever get beyond the oldest 
10%.


You're correct; I stated that badly.  What I should have said is that your 
LRU writer could potentially scan the pool as fast as once per second if 
there were enough allocations going on.



How low does the count have to go, or does it track the count when it
becomes dirty and look for a decrease?


The usage count has to be 0 before a page can be re-used for a new 
allocation, and the LRU background writer only writes out potentially 
reusable pages that are dirty.  So the count has to be 0 before it will 
write it.



This should keep us reading new pages, which covers this, yes?


One would hope.  Your whole arrangement of shared_buffers, 
checkpoint_segments, and related parameters will need to be reconsidered 
for 8.3; you've got a delicated balanced arrangement for your 8.2 setup 
right now that's working for you, but just translating it straight to 8.3 
won't get you what you want.  I'll get back to the message you already 
sent on that subject when I get enough time to address it fully.



I'm not clear on the benefit of not writing the recently accessed dirty
pages when there are no less recently used dirty pages.


This presumes PostgreSQL has some notion of the balance of recently 
accessed vs. not accessed dirty pages, which it does not.  Buffers get 
updated individually, and there's no mechanism summarizing what's in 
there; you have to scan the buffer cache yourself to figure that out.  I 
do some of that in this new patch, tracking things like how many buffers 
are scanned on average to find reusable ones.


Many months ago, I wrote a very complicated re-implementation of the 
all-scan portion of the background writer that tracked the usage count of 
everything it looked at, kept statistics about how many pages were dirty 
at each usage count, then targeted how high of a usage count could be 
written given some information about what I/O rate you felt your devices 
could sustain.  This did exactly what you're asking for here:  wrote 
whatever dirty pages were around starting with the ones that hadn't been 
recently used, then worked its way up to pages with a higher usage count 
if the recently used ones were all clean.


As far as I've been able to tell, and from Heikki's test results, the load 
distributed checkpoint was a better answer to this problem.  Rather than 
constantly fight to get pages with high usage counts out all the time, 
just spread the checkpoint out instead and deal with them only then.  I 
gave up on that branch of code while he removed the all-scan writer 
altogether as part of committing LDC.  I suspect the path I was following 
was exactly what you think you'd like to have, but it seems that it's not 
actually needed.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Just-in-time Background Writer Patch+Test Results

2007-09-06 Thread Decibel!
On Thu, Sep 06, 2007 at 09:20:31AM -0500, Kevin Grittner wrote:
  On Wed, Sep 5, 2007 at 10:31 PM, in message
 [EMAIL PROTECTED], Greg Smith
 [EMAIL PROTECTED] wrote: 
  
  -There are two magic constants in the code:
  
   int smoothing_samples = 16;
   float   scan_whole_pool_seconds = 120.0;
  
 
  I personally 
  don't feel like these constants need to be exposed for tuning purposes;
 
  Determining 
  whether these should be exposed as GUC tunables is certainly an open 
  question though.
  
 If you exposed the scan_whole_pool_seconds as a tunable GUC, that would
 allay all of my concerns about this patch.  Basically, our problems were

I like the idea of not having that as a GUC, but I'm doubtful that it
can be hard-coded like that. What if checkpoint_timeout is set to 120?
Or 60? Or 2000?

I don't know that there should be a direct correlation, but ISTM that
scan_whole_pool_seconds should take checkpoint intervals into account
somehow.
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpiBGkQouND3.pgp
Description: PGP signature