Re: [HACKERS] RC2 and open issues

2004-12-27 Thread Bruce Momjian
Greg Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  Suppose that you run a checkpoint every 5 minutes, and with the knob
  you slow down the checkpoint to extend over say 3 minutes on average,
  rather than the normal blast-it-out-as-fast-as-possible.  Then you'll
  be keeping an average of 8 minutes worth of WAL files instead of 5.
  Not exactly a killer objection.
 
 Right. I was thinking that the goal would be to spread the checkpoint out over
 exactly the checkpoint interval, minus some safety factor. So if it has some
 estimate of the total number of dirty buffers that need flushing it could just
 divide the checkpoint interval by that and calculate the delay needed to
 finish in some fraction of the checkpoint interval, 60% seems like a
 reasonable guess.
 
  One issue is that while we can regulate the rate at which we issue
  write()s, we still have to issue fsync()s at the end, and we can't
  control what happens in response to those.  It's quite possible that
  all the I/O would happen in response to the fsync()s anyway, in which
  case the whole exercise would be a waste of time.
 
 Well you could fsync earlier as well, say just before whenever you sleep.
 Obviously the delay on the checkpoint process doesn't matter to performance if
 it's about to sleep. It could end up scheduling i/o earlier than necessary and
 cause redundant seeks but then I guess that's an inherent tension between
 trying to spread out the i/o evenly and trying to get the ideal ordering of
 i/o.

It certainly is an interesting idea to have the checkpoint span a longer
time period.  We couldn't do that with sync, but now that we fsync each
file it is possible.

It would be easy do this if we didn't also need the fsync.  The original
idea was that we would write() the dirty buffers long before the
checkpoint, and the kernel would write many of these dirty buffers
before we got to checkpoint time.

We could go with the checkpoint clock sweep idea but then we aren't
writing them but actually doing write/fsync a lot more.  I can't think
of a way this would be a win.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] RC2 and open issues

2004-12-24 Thread Kenneth Marshall
On Mon, Dec 20, 2004 at 11:20:46PM -0500, Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  Tom Lane wrote:
  Exactly.  But 1% would be uselessly small with this definition.  Offhand
  I'd think something like 50% might be a starting point; maybe even more.
  What that says is that a page isn't a candidate to be written out by the
  bgwriter until it's fallen halfway down the LRU list.
 
  So we are not scanning by buffer address but using the LRU list?  Are we
  sure they are mostly dirty?
 
 No.  The entire point is to keep the LRU end of the list mostly clean.
 
 Now that you mention it, it might be interesting to try the approach of
 doing a clock scan on the buffer array and ignoring the ARC lists
 entirely.  That would be a fundamentally different way of envisioning
 what the bgwriter is supposed to do, though.  I think the main reason
 Jan didn't try that was he wanted to be sure the LRU page was usually
 clean so that backends would seldom end up doing writes for themselves
 when they needed to get a free buffer.
 
 Maybe we need a hybrid approach: clean a few percent of the LRU end of
 the ARC list in order to keep backends from blocking on writes, plus run
 a clock scan to keep checkpoints from having to do much.  But that's way
 beyond what we have time for in the 8.0 cycle.
 
   regards, tom lane
 

I have not had a chance to investigate, but there is a modification of
the ARC cache strategy called CAR that replaces the LRU linked lists
with the clock approximation to the LRU lists. This algorithm is virtually
identical to the current ARC but reduces the contention at the MRU end
of the lists. This may dovetail nicely into your idea of a clock bgwriter
functionality as well as help with the cache-line performance problem.

Yours,
Ken Marshall

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] RC2 and open issues

2004-12-24 Thread Bruce Momjian
Greg Stark wrote:
 
 Tom Lane [EMAIL PROTECTED] writes:
 
  Maybe we need a hybrid approach: clean a few percent of the LRU end of
  the ARC list in order to keep backends from blocking on writes, plus run
  a clock scan to keep checkpoints from having to do much.  
 
 Well if you just keep note of when the last clock scan started then when you
 get to the end of the list you've _done_ a checkpoint.
 
 Put another way, we already have such a clock scan, it's called checkpoint.
 You could have checkpoint delay between each page write long enough to spread
 the checkpoint i/o out over a configurable amount of time -- say half the
 checkpoint interval -- and be done with that side of things.

But don't you have to keep the WAL files around longer then.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] RC2 and open issues

2004-12-24 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Greg Stark wrote:
 Put another way, we already have such a clock scan, it's called checkpoint.
 You could have checkpoint delay between each page write long enough to spread
 the checkpoint i/o out over a configurable amount of time -- say half the
 checkpoint interval -- and be done with that side of things.

 But don't you have to keep the WAL files around longer then.

Yeah, but do you care?  It seems like what Greg is suggesting is a
checkpoint slowdown knob comparable to the vacuum slowdown
feature that Jan added for 8.0.  It strikes me as not necessarily
a bad idea.

Suppose that you run a checkpoint every 5 minutes, and with the knob
you slow down the checkpoint to extend over say 3 minutes on average,
rather than the normal blast-it-out-as-fast-as-possible.  Then you'll
be keeping an average of 8 minutes worth of WAL files instead of 5.
Not exactly a killer objection.

Shutdown checkpoints would still need to go as fast as possible,
so we might need two separate code paths; or maybe we could just
change the delay setting locally during a shutdown.

One issue is that while we can regulate the rate at which we issue
write()s, we still have to issue fsync()s at the end, and we can't
control what happens in response to those.  It's quite possible that
all the I/O would happen in response to the fsync()s anyway, in which
case the whole exercise would be a waste of time.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] RC2 and open issues

2004-12-24 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Suppose that you run a checkpoint every 5 minutes, and with the knob
 you slow down the checkpoint to extend over say 3 minutes on average,
 rather than the normal blast-it-out-as-fast-as-possible.  Then you'll
 be keeping an average of 8 minutes worth of WAL files instead of 5.
 Not exactly a killer objection.

Right. I was thinking that the goal would be to spread the checkpoint out over
exactly the checkpoint interval, minus some safety factor. So if it has some
estimate of the total number of dirty buffers that need flushing it could just
divide the checkpoint interval by that and calculate the delay needed to
finish in some fraction of the checkpoint interval, 60% seems like a
reasonable guess.

 One issue is that while we can regulate the rate at which we issue
 write()s, we still have to issue fsync()s at the end, and we can't
 control what happens in response to those.  It's quite possible that
 all the I/O would happen in response to the fsync()s anyway, in which
 case the whole exercise would be a waste of time.

Well you could fsync earlier as well, say just before whenever you sleep.
Obviously the delay on the checkpoint process doesn't matter to performance if
it's about to sleep. It could end up scheduling i/o earlier than necessary and
cause redundant seeks but then I guess that's an inherent tension between
trying to spread out the i/o evenly and trying to get the ideal ordering of
i/o.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] RC2 and open issues

2004-12-22 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Maybe we need a hybrid approach: clean a few percent of the LRU end of
 the ARC list in order to keep backends from blocking on writes, plus run
 a clock scan to keep checkpoints from having to do much.  

Well if you just keep note of when the last clock scan started then when you
get to the end of the list you've _done_ a checkpoint.

Put another way, we already have such a clock scan, it's called checkpoint.
You could have checkpoint delay between each page write long enough to spread
the checkpoint i/o out over a configurable amount of time -- say half the
checkpoint interval -- and be done with that side of things.

-- 
greg


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] RC2 and open issues

2004-12-22 Thread Simon Riggs
On Tue, 2004-12-21 at 15:26, Tom Lane wrote:
 Richard Huxton dev@archonet.com writes:
  However, one thing you can say is that if block B hasn't been written to 
  since you last checked, then any blocks older than that haven't been 
  written to either.
 
 [ itch... ]  Can you?  I don't recall exactly when a block gets pushed
 up the ARC list during a ReadBuffer/WriteBuffer cycle, but at the very
 least I'd have to say that this assumption is vulnerable to race
 conditions.
 

An intriguing idea: after some thought this would only be true if all
block accesses were writes. A block can be re-read (but not written),
causing it to move to the MRU of T2, thus moving it ahead of other dirty
buffers.

Forgive me: the conveyor belt analogy only applies when blocks on the
buffer list haven't been touched *at all*. i.e. if they are hit only
once (on T1) or twice (T2) they then just move down towards the LRU and
roll off when they get there.

-- 
Best Regards, Simon Riggs


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: Re: [HACKERS] RC2 and open issues

2004-12-21 Thread simon

Tom Lane [EMAIL PROTECTED] wrote on 21.12.2004, 07:32:52:
 Gavin Sherry  writes:
  I was also thinking of benchmarking the effect of changing the algorithm

changing the algorithm is a phrase that sends shivers up my spine. My
own preference is towards some change, but as minimal as possible. 

  in StrategyDirtyBufferList(): currently, for each iteration of the loop we
  read a buffer from each of T1 and T2. I was wondering what effect reading
  T1 first then T2 and vice versa would have on performance.
 
 Looking at StrategyGetBuffer, it definitely seems like a good idea to
 try to keep the bottom end of both T1 and T2 lists clean.  But we should
 work at T1 a bit harder.
 
 The insight I take away from today's discussion is that there are two
 separate goals here: try to keep backends that acquire a buffer via
 StrategyGetBuffer from being fed a dirty buffer they have to write,
 and try to keep the next upcoming checkpoint from having too much work
 to do.  Those are both laudable goals but I hadn't really seen before
 that they may require different strategies to achieve.  I'm liking the
 idea that bgwriter should alternate between doing writes in pursuit of
 the one goal and doing writes in pursuit of the other.

Agreed: there are two different goals for buffer list management.

I like the way the current algorithm searches both T1 and T2 in
parallel, since that works no matter how long each list is. Always
cleaning one list in preference to the other would not work well since
ARC fluctuates. At any point in time, cleaning one list will have more
benefit than cleaning the other, but which one is best switches
backwards and forwards as ARC fluctuates. 

Perhaps the best way would be to concentrate on the list that, at this
point in time, is the one that needs to be cleanest. I *think* that
means we should concentrate on the LRU of the *longest* list, since
that is the direction in which ARC is trying to move (I agree that
seems counter-intuitive: but a few pairs of eyes should confirm which
way round it is)

By observation, DBT2 ends up with T2  T1, but that is a result of its
fairly static nature. i.e. DBT2 would benefit from T2 LRU cleaning.

ISTM it would be good to have:
1) very frequent, but small cleaning action on the lists, say every 50ms
to avoid backends having to write a buffer
2) less frequent, deeper cleaning actions, to minimise the effect of
checkpoints, which could be done every 10th cycle e.g. 500ms
(numbers would vary according to workload...)

But, like I said: change, but minimal change seems best to me for now.

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: Re: [HACKERS] RC2 and open issues

2004-12-21 Thread simon

Tom Lane [EMAIL PROTECTED] wrote on 21.12.2004, 05:05:36:
 Bruce Momjian  writes:
  I am confused.  If we change the percentage to be X% of the entire
  buffer cache, and we set it to 1%, and we exit when either the dirty
  pages or % are reached, don't we end up just scanning the first 1% of
  the cache over and over again?
 
 Exactly.  But 1% would be uselessly small with this definition.  Offhand
 I'd think something like 50% might be a starting point; maybe even more.
 What that says is that a page isn't a candidate to be written out by the
 bgwriter until it's fallen halfway down the LRU list.
 

I see the buffer list as a conveyor belt that carries unneeded blocks
away from the MRU. Cleaning near the LRU (I agree: How near?) should be
all that is sufficient to keep the list clean.

Cleaning the first 1% over and over again makes it sound like it is
the same list of blocks that are being cleaned. It may be the same
linked list data structure, but that is dynamically changing to contain
completely different blocks from the last time you looked.

Best Regards, Simon Riggs

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] RC2 and open issues

2004-12-21 Thread Zeugswetter Andreas DAZ SD

 If we don't start where we left off, I am thinking if you do a lot of
 writes then do nothing, the next checkpoint would be huge because a lot
 of the LRU will be dirty because the bgwriter never got to it.

I think the problem is, that we don't see wether a read hot 
page is also write hot. We would want to write dirty read hot pages,
but not write hot pages. It does not make sense to write a write hot
page since it will be dirty again when the checkpoint comes.

Andreas

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] RC2 and open issues

2004-12-21 Thread Richard Huxton
[EMAIL PROTECTED] wrote:
Tom Lane [EMAIL PROTECTED] wrote on 21.12.2004, 05:05:36:
Bruce Momjian  writes:
I am confused.  If we change the percentage to be X% of the entire
buffer cache, and we set it to 1%, and we exit when either the dirty
pages or % are reached, don't we end up just scanning the first 1% of
the cache over and over again?
Exactly.  But 1% would be uselessly small with this definition.  Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.

I see the buffer list as a conveyor belt that carries unneeded blocks
away from the MRU. Cleaning near the LRU (I agree: How near?) should be
all that is sufficient to keep the list clean.
Cleaning the first 1% over and over again makes it sound like it is
the same list of blocks that are being cleaned. It may be the same
linked list data structure, but that is dynamically changing to contain
completely different blocks from the last time you looked.
However, one thing you can say is that if block B hasn't been written to 
 since you last checked, then any blocks older than that haven't been 
written to either. Of course, the problem is in finding block B again 
without re-scanning from the LRU end.

Is there any non-intrusive way we could add a bookmark into the 
conveyer-belt? (mixing my metaphors again :-) Any blocks written to 
would move up the cache, effectively moving the bookmark lower. Enough 
activity would cause the bookmark to drop off the end. If that isn't the 
case though, we know we can safely skip any blocks older than the bookmark.

--
  Richard Huxton
  Archonet Ltd
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [HACKERS] RC2 and open issues

2004-12-21 Thread Jim C. Nasby
On Tue, Dec 21, 2004 at 10:26:48AM -0500, Tom Lane wrote:
 Richard Huxton [EMAIL PROTECTED] writes:
  However, one thing you can say is that if block B hasn't been written to 
  since you last checked, then any blocks older than that haven't been 
  written to either.
 
 [ itch... ]  Can you?  I don't recall exactly when a block gets pushed
 up the ARC list during a ReadBuffer/WriteBuffer cycle, but at the very
 least I'd have to say that this assumption is vulnerable to race
 conditions.
 
 Also, the cntxDirty mechanism allows a block to be dirtied without
 changing the ARC state at all.  I am not very clear on whether Vadim
 added that mechanism just for performance or because there were
 fundamental deadlock issues without it; but in either case we'd have
 to think long and hard about taking it out for the bgwriter's benefit.

OTOH, ISTM that it's ok if the bgwriter occasionally misses blocks.
These blocks would either result in a backend or the checkpointer having
to write out a block (not so great), or the bgwriter could occasionally
ignore it's bookmart and restart it's scan from the LRU.

Of course I'm assuming that any race-conditions could be made to impact
only the bgwriter and nothing else, which may be a bad assumption.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[HACKERS] RC2 and open issues

2004-12-20 Thread Bruce Momjian
We are now packaging RC2.  If nothing comes up after RC2 is released, we
can move to final release.

The open items list is attached.  The doc changes can be easily
completed before final.  The only code issue left is with bgwriter.  We
always knew we needed to find better defaults for its parameters, but we
are only now finding more fundamental issues.

I think the summary I have seen recently pegs it right --- our use of %
of dirty buffers requires a scan of the entire buffer cache, and the
current delay of bgwriter is too high, but we can't lower it because the
buffer cache scan will become too expensive if done too frequently.

I think the ideal solution would be to remove bgwriter_percent or change
it to be a percentage of all buffers, not just dirty buffers, so we
don't have to scan the entire list.  If we set the new value to 10% with
a delay of 1 second, and the bgwriter remembers the place it stopped
scanning the buffer cache, you will clean out the buffer cache
completely every 10 seconds.

Right now it seems no one can find proper values.  We were clear that
this was an issue but it is bad news that we are only addressing it
during RC.

The 8.1 solution is to have some feedback system so writes by individual
backends cause the bgwriter to work more frequently.

The big question is what to do during RC2?  Do we just leave it as
suboptimal knowing we will revisit it in 8.1 or try an incremental
solution for 8.0 that might work better.

We have to decide now.

---

   PostgreSQL 8.0 Open Items
   =

Current version at http://candle.pha.pa.us/cgi-bin/pgopenitems.

Changes
---
* change bgwriter buffer scan behavior?
* adjust bgwriter defaults

Documentation
-
* synchonize supported encodings and docs
* improve external interfaces documentation section
* manual pages

Fixed Since Last Beta
-

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 I think the ideal solution would be to remove bgwriter_percent or change
 it to be a percentage of all buffers, not just dirty buffers, so we
 don't have to scan the entire list.  If we set the new value to 10% with
 a delay of 1 second, and the bgwriter remembers the place it stopped
 scanning the buffer cache, you will clean out the buffer cache
 completely every 10 seconds.

But we don't *want* it to clean out the buffer cache completely.
There's no point in writing a hot page every few seconds.  So I don't
think I believe in remembering where we stopped anyway.

I think there's a reasonable case to be made for redefining
bgwriter_percent as the max percent of the total buffer list to scan
(not the max percent of the list to return --- Jan correctly pointed out
that the latter is useless).  Then we could modify
StrategyDirtyBufferList so that the percent and maxpages parameters are
passed in, so it can stop as soon as either one is satisfied.  This
would be a fairly small/safe code change and I wouldn't have a problem
doing it even at this late stage of the cycle.

Howeve ... we would have to crank up the default bgwriter_percent,
and I don't know if we have any better idea what to set it to after
such a change than we do now ...

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I think the ideal solution would be to remove bgwriter_percent or change
  it to be a percentage of all buffers, not just dirty buffers, so we
  don't have to scan the entire list.  If we set the new value to 10% with
  a delay of 1 second, and the bgwriter remembers the place it stopped
  scanning the buffer cache, you will clean out the buffer cache
  completely every 10 seconds.
 
 But we don't *want* it to clean out the buffer cache completely.

You are only cleaning out in pieces over a 10 second period so it is
getting dirty.  You are not scanning the entire buffer at one time.

 There's no point in writing a hot page every few seconds.  So I don't
 think I believe in remembering where we stopped anyway.

I was thinking if you are doing this scanning every X milliseconds then
after a while the front of the buffer cache will be mostly clean and the
end will be dirty so you will always be going over the same early ones
to get to the later dirty ones.  Remembering the location gives the scan
more uniform coverage of the buffer cache.

You need a clock sweep like BSD uses (and probably others).

 I think there's a reasonable case to be made for redefining
 bgwriter_percent as the max percent of the total buffer list to scan
 (not the max percent of the list to return --- Jan correctly pointed out
 that the latter is useless).  Then we could modify
 StrategyDirtyBufferList so that the percent and maxpages parameters are
 passed in, so it can stop as soon as either one is satisfied.  This
 would be a fairly small/safe code change and I wouldn't have a problem
 doing it even at this late stage of the cycle.
 
 Howeve ... we would have to crank up the default bgwriter_percent,
 and I don't know if we have any better idea what to set it to after
 such a change than we do now ...

Once we make the change we will have to get our testers working on it. 
We need those figure to change over time based on backends doing writes
but ath isn't going to happen for 8.0.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 You need a clock sweep like BSD uses (and probably others).

No, that's *fundamentally* wrong.

The reason we are going to the trouble of maintaining a complicated
cache algorithm like ARC is so that we can tell the heavily used pages
from the lesser used ones.  To throw away that knowledge in favor of
doing I/O with a plain clock sweep algorithm is just wrong.

What's more, I don't even understand what clock sweep would mean given
that the ordering of the list is constantly changing.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 I am confused.  If we change the percentage to be X% of the entire
 buffer cache, and we set it to 1%, and we exit when either the dirty
 pages or % are reached, don't we end up just scanning the first 1% of
 the cache over and over again?

Exactly.  But 1% would be uselessly small with this definition.  Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  I am confused.  If we change the percentage to be X% of the entire
  buffer cache, and we set it to 1%, and we exit when either the dirty
  pages or % are reached, don't we end up just scanning the first 1% of
  the cache over and over again?
 
 Exactly.  But 1% would be uselessly small with this definition.  Offhand
 I'd think something like 50% might be a starting point; maybe even more.
 What that says is that a page isn't a candidate to be written out by the
 bgwriter until it's fallen halfway down the LRU list.

So we are not scanning by buffer address but using the LRU list?  Are we
sure they are mostly dirty?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Exactly.  But 1% would be uselessly small with this definition.  Offhand
 I'd think something like 50% might be a starting point; maybe even more.
 What that says is that a page isn't a candidate to be written out by the
 bgwriter until it's fallen halfway down the LRU list.

 So we are not scanning by buffer address but using the LRU list?  Are we
 sure they are mostly dirty?

No.  The entire point is to keep the LRU end of the list mostly clean.

Now that you mention it, it might be interesting to try the approach of
doing a clock scan on the buffer array and ignoring the ARC lists
entirely.  That would be a fundamentally different way of envisioning
what the bgwriter is supposed to do, though.  I think the main reason
Jan didn't try that was he wanted to be sure the LRU page was usually
clean so that backends would seldom end up doing writes for themselves
when they needed to get a free buffer.

Maybe we need a hybrid approach: clean a few percent of the LRU end of
the ARC list in order to keep backends from blocking on writes, plus run
a clock scan to keep checkpoints from having to do much.  But that's way
beyond what we have time for in the 8.0 cycle.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Exactly.  But 1% would be uselessly small with this definition.  Offhand
  I'd think something like 50% might be a starting point; maybe even more.
  What that says is that a page isn't a candidate to be written out by the
  bgwriter until it's fallen halfway down the LRU list.
 
  So we are not scanning by buffer address but using the LRU list?  Are we
  sure they are mostly dirty?
 
 No.  The entire point is to keep the LRU end of the list mostly clean.
 
 Now that you mention it, it might be interesting to try the approach of
 doing a clock scan on the buffer array and ignoring the ARC lists
 entirely.  That would be a fundamentally different way of envisioning
 what the bgwriter is supposed to do, though.  I think the main reason
 Jan didn't try that was he wanted to be sure the LRU page was usually
 clean so that backends would seldom end up doing writes for themselves
 when they needed to get a free buffer.
 
 Maybe we need a hybrid approach: clean a few percent of the LRU end of
 the ARC list in order to keep backends from blocking on writes, plus run
 a clock scan to keep checkpoints from having to do much.  But that's way
 beyond what we have time for in the 8.0 cycle.

OK, so we scan from the end of the LRU.  If we scan X% and find _no_
dirty buffers perhaps we should start where we left off last time.

If we don't start where we left off, I am thinking if you do a lot of
writes then do nothing, the next checkpoint would be huge because a lot
of the LRU will be dirty because the bgwriter never got to it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Gavin Sherry
On Mon, 20 Dec 2004, Tom Lane wrote:

 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Exactly.  But 1% would be uselessly small with this definition.  Offhand
  I'd think something like 50% might be a starting point; maybe even more.
  What that says is that a page isn't a candidate to be written out by the
  bgwriter until it's fallen halfway down the LRU list.

  So we are not scanning by buffer address but using the LRU list?  Are we
  sure they are mostly dirty?

 No.  The entire point is to keep the LRU end of the list mostly clean.

 Now that you mention it, it might be interesting to try the approach of
 doing a clock scan on the buffer array and ignoring the ARC lists
 entirely.  That would be a fundamentally different way of envisioning
 what the bgwriter is supposed to do, though.  I think the main reason
 Jan didn't try that was he wanted to be sure the LRU page was usually
 clean so that backends would seldom end up doing writes for themselves
 when they needed to get a free buffer.

Neil and I spoke with Jan briefly last week and he mentioned a few
different approaches he'd been tossing over. Firstly, for alternative
runs, start X% on from the LRU, so that we aren't scanning clean buffers
all the time. Secondly, follow something like the approach you've
mentioned above but remember the offset. So, if we're scanning 10%, after
10 runs we will have written out all buffers.

I was also thinking of benchmarking the effect of changing the algorithm
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance. I haven't
thought about this too hard, though, so it might be wrong headed.



 Maybe we need a hybrid approach: clean a few percent of the LRU end of
 the ARC list in order to keep backends from blocking on writes, plus run
 a clock scan to keep checkpoints from having to do much.  But that's way
 beyond what we have time for in the 8.0 cycle.

Definately.


   regards, tom lane


Thanks,

Gavin

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Bruce Momjian
Gavin Sherry wrote:
 Neil and I spoke with Jan briefly last week and he mentioned a few
 different approaches he'd been tossing over. Firstly, for alternative
 runs, start X% on from the LRU, so that we aren't scanning clean buffers
 all the time. Secondly, follow something like the approach you've
 mentioned above but remember the offset. So, if we're scanning 10%, after
 10 runs we will have written out all buffers.
 
 I was also thinking of benchmarking the effect of changing the algorithm
 in StrategyDirtyBufferList(): currently, for each iteration of the loop we
 read a buffer from each of T1 and T2. I was wondering what effect reading
 T1 first then T2 and vice versa would have on performance. I haven't
 thought about this too hard, though, so it might be wrong headed.

So we are all thinking in the same direction.  We might have only a few
days to finalize this before final release.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] RC2 and open issues

2004-12-20 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes:
 I was also thinking of benchmarking the effect of changing the algorithm
 in StrategyDirtyBufferList(): currently, for each iteration of the loop we
 read a buffer from each of T1 and T2. I was wondering what effect reading
 T1 first then T2 and vice versa would have on performance.

Looking at StrategyGetBuffer, it definitely seems like a good idea to
try to keep the bottom end of both T1 and T2 lists clean.  But we should
work at T1 a bit harder.

The insight I take away from today's discussion is that there are two
separate goals here: try to keep backends that acquire a buffer via
StrategyGetBuffer from being fed a dirty buffer they have to write,
and try to keep the next upcoming checkpoint from having too much work
to do.  Those are both laudable goals but I hadn't really seen before
that they may require different strategies to achieve.  I'm liking the
idea that bgwriter should alternate between doing writes in pursuit of
the one goal and doing writes in pursuit of the other.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org