Re: [HACKERS] RC2 and open issues
Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: Suppose that you run a checkpoint every 5 minutes, and with the knob you slow down the checkpoint to extend over say 3 minutes on average, rather than the normal blast-it-out-as-fast-as-possible. Then you'll be keeping an average of 8 minutes worth of WAL files instead of 5. Not exactly a killer objection. Right. I was thinking that the goal would be to spread the checkpoint out over exactly the checkpoint interval, minus some safety factor. So if it has some estimate of the total number of dirty buffers that need flushing it could just divide the checkpoint interval by that and calculate the delay needed to finish in some fraction of the checkpoint interval, 60% seems like a reasonable guess. One issue is that while we can regulate the rate at which we issue write()s, we still have to issue fsync()s at the end, and we can't control what happens in response to those. It's quite possible that all the I/O would happen in response to the fsync()s anyway, in which case the whole exercise would be a waste of time. Well you could fsync earlier as well, say just before whenever you sleep. Obviously the delay on the checkpoint process doesn't matter to performance if it's about to sleep. It could end up scheduling i/o earlier than necessary and cause redundant seeks but then I guess that's an inherent tension between trying to spread out the i/o evenly and trying to get the ideal ordering of i/o. It certainly is an interesting idea to have the checkpoint span a longer time period. We couldn't do that with sync, but now that we fsync each file it is possible. It would be easy do this if we didn't also need the fsync. The original idea was that we would write() the dirty buffers long before the checkpoint, and the kernel would write many of these dirty buffers before we got to checkpoint time. We could go with the checkpoint clock sweep idea but then we aren't writing them but actually doing write/fsync a lot more. I can't think of a way this would be a win. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] RC2 and open issues
On Mon, Dec 20, 2004 at 11:20:46PM -0500, Tom Lane wrote: Bruce Momjian pgman@candle.pha.pa.us writes: Tom Lane wrote: Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. So we are not scanning by buffer address but using the LRU list? Are we sure they are mostly dirty? No. The entire point is to keep the LRU end of the list mostly clean. Now that you mention it, it might be interesting to try the approach of doing a clock scan on the buffer array and ignoring the ARC lists entirely. That would be a fundamentally different way of envisioning what the bgwriter is supposed to do, though. I think the main reason Jan didn't try that was he wanted to be sure the LRU page was usually clean so that backends would seldom end up doing writes for themselves when they needed to get a free buffer. Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. But that's way beyond what we have time for in the 8.0 cycle. regards, tom lane I have not had a chance to investigate, but there is a modification of the ARC cache strategy called CAR that replaces the LRU linked lists with the clock approximation to the LRU lists. This algorithm is virtually identical to the current ARC but reduces the contention at the MRU end of the lists. This may dovetail nicely into your idea of a clock bgwriter functionality as well as help with the cache-line performance problem. Yours, Ken Marshall ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] RC2 and open issues
Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. Well if you just keep note of when the last clock scan started then when you get to the end of the list you've _done_ a checkpoint. Put another way, we already have such a clock scan, it's called checkpoint. You could have checkpoint delay between each page write long enough to spread the checkpoint i/o out over a configurable amount of time -- say half the checkpoint interval -- and be done with that side of things. But don't you have to keep the WAL files around longer then. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] RC2 and open issues
Bruce Momjian pgman@candle.pha.pa.us writes: Greg Stark wrote: Put another way, we already have such a clock scan, it's called checkpoint. You could have checkpoint delay between each page write long enough to spread the checkpoint i/o out over a configurable amount of time -- say half the checkpoint interval -- and be done with that side of things. But don't you have to keep the WAL files around longer then. Yeah, but do you care? It seems like what Greg is suggesting is a checkpoint slowdown knob comparable to the vacuum slowdown feature that Jan added for 8.0. It strikes me as not necessarily a bad idea. Suppose that you run a checkpoint every 5 minutes, and with the knob you slow down the checkpoint to extend over say 3 minutes on average, rather than the normal blast-it-out-as-fast-as-possible. Then you'll be keeping an average of 8 minutes worth of WAL files instead of 5. Not exactly a killer objection. Shutdown checkpoints would still need to go as fast as possible, so we might need two separate code paths; or maybe we could just change the delay setting locally during a shutdown. One issue is that while we can regulate the rate at which we issue write()s, we still have to issue fsync()s at the end, and we can't control what happens in response to those. It's quite possible that all the I/O would happen in response to the fsync()s anyway, in which case the whole exercise would be a waste of time. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] RC2 and open issues
Tom Lane [EMAIL PROTECTED] writes: Suppose that you run a checkpoint every 5 minutes, and with the knob you slow down the checkpoint to extend over say 3 minutes on average, rather than the normal blast-it-out-as-fast-as-possible. Then you'll be keeping an average of 8 minutes worth of WAL files instead of 5. Not exactly a killer objection. Right. I was thinking that the goal would be to spread the checkpoint out over exactly the checkpoint interval, minus some safety factor. So if it has some estimate of the total number of dirty buffers that need flushing it could just divide the checkpoint interval by that and calculate the delay needed to finish in some fraction of the checkpoint interval, 60% seems like a reasonable guess. One issue is that while we can regulate the rate at which we issue write()s, we still have to issue fsync()s at the end, and we can't control what happens in response to those. It's quite possible that all the I/O would happen in response to the fsync()s anyway, in which case the whole exercise would be a waste of time. Well you could fsync earlier as well, say just before whenever you sleep. Obviously the delay on the checkpoint process doesn't matter to performance if it's about to sleep. It could end up scheduling i/o earlier than necessary and cause redundant seeks but then I guess that's an inherent tension between trying to spread out the i/o evenly and trying to get the ideal ordering of i/o. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] RC2 and open issues
Tom Lane [EMAIL PROTECTED] writes: Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. Well if you just keep note of when the last clock scan started then when you get to the end of the list you've _done_ a checkpoint. Put another way, we already have such a clock scan, it's called checkpoint. You could have checkpoint delay between each page write long enough to spread the checkpoint i/o out over a configurable amount of time -- say half the checkpoint interval -- and be done with that side of things. -- greg ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] RC2 and open issues
On Tue, 2004-12-21 at 15:26, Tom Lane wrote: Richard Huxton dev@archonet.com writes: However, one thing you can say is that if block B hasn't been written to since you last checked, then any blocks older than that haven't been written to either. [ itch... ] Can you? I don't recall exactly when a block gets pushed up the ARC list during a ReadBuffer/WriteBuffer cycle, but at the very least I'd have to say that this assumption is vulnerable to race conditions. An intriguing idea: after some thought this would only be true if all block accesses were writes. A block can be re-read (but not written), causing it to move to the MRU of T2, thus moving it ahead of other dirty buffers. Forgive me: the conveyor belt analogy only applies when blocks on the buffer list haven't been touched *at all*. i.e. if they are hit only once (on T1) or twice (T2) they then just move down towards the LRU and roll off when they get there. -- Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: Re: [HACKERS] RC2 and open issues
Tom Lane [EMAIL PROTECTED] wrote on 21.12.2004, 07:32:52: Gavin Sherry writes: I was also thinking of benchmarking the effect of changing the algorithm changing the algorithm is a phrase that sends shivers up my spine. My own preference is towards some change, but as minimal as possible. in StrategyDirtyBufferList(): currently, for each iteration of the loop we read a buffer from each of T1 and T2. I was wondering what effect reading T1 first then T2 and vice versa would have on performance. Looking at StrategyGetBuffer, it definitely seems like a good idea to try to keep the bottom end of both T1 and T2 lists clean. But we should work at T1 a bit harder. The insight I take away from today's discussion is that there are two separate goals here: try to keep backends that acquire a buffer via StrategyGetBuffer from being fed a dirty buffer they have to write, and try to keep the next upcoming checkpoint from having too much work to do. Those are both laudable goals but I hadn't really seen before that they may require different strategies to achieve. I'm liking the idea that bgwriter should alternate between doing writes in pursuit of the one goal and doing writes in pursuit of the other. Agreed: there are two different goals for buffer list management. I like the way the current algorithm searches both T1 and T2 in parallel, since that works no matter how long each list is. Always cleaning one list in preference to the other would not work well since ARC fluctuates. At any point in time, cleaning one list will have more benefit than cleaning the other, but which one is best switches backwards and forwards as ARC fluctuates. Perhaps the best way would be to concentrate on the list that, at this point in time, is the one that needs to be cleanest. I *think* that means we should concentrate on the LRU of the *longest* list, since that is the direction in which ARC is trying to move (I agree that seems counter-intuitive: but a few pairs of eyes should confirm which way round it is) By observation, DBT2 ends up with T2 T1, but that is a result of its fairly static nature. i.e. DBT2 would benefit from T2 LRU cleaning. ISTM it would be good to have: 1) very frequent, but small cleaning action on the lists, say every 50ms to avoid backends having to write a buffer 2) less frequent, deeper cleaning actions, to minimise the effect of checkpoints, which could be done every 10th cycle e.g. 500ms (numbers would vary according to workload...) But, like I said: change, but minimal change seems best to me for now. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: Re: [HACKERS] RC2 and open issues
Tom Lane [EMAIL PROTECTED] wrote on 21.12.2004, 05:05:36: Bruce Momjian writes: I am confused. If we change the percentage to be X% of the entire buffer cache, and we set it to 1%, and we exit when either the dirty pages or % are reached, don't we end up just scanning the first 1% of the cache over and over again? Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. I see the buffer list as a conveyor belt that carries unneeded blocks away from the MRU. Cleaning near the LRU (I agree: How near?) should be all that is sufficient to keep the list clean. Cleaning the first 1% over and over again makes it sound like it is the same list of blocks that are being cleaned. It may be the same linked list data structure, but that is dynamically changing to contain completely different blocks from the last time you looked. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] RC2 and open issues
If we don't start where we left off, I am thinking if you do a lot of writes then do nothing, the next checkpoint would be huge because a lot of the LRU will be dirty because the bgwriter never got to it. I think the problem is, that we don't see wether a read hot page is also write hot. We would want to write dirty read hot pages, but not write hot pages. It does not make sense to write a write hot page since it will be dirty again when the checkpoint comes. Andreas ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] RC2 and open issues
[EMAIL PROTECTED] wrote: Tom Lane [EMAIL PROTECTED] wrote on 21.12.2004, 05:05:36: Bruce Momjian writes: I am confused. If we change the percentage to be X% of the entire buffer cache, and we set it to 1%, and we exit when either the dirty pages or % are reached, don't we end up just scanning the first 1% of the cache over and over again? Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. I see the buffer list as a conveyor belt that carries unneeded blocks away from the MRU. Cleaning near the LRU (I agree: How near?) should be all that is sufficient to keep the list clean. Cleaning the first 1% over and over again makes it sound like it is the same list of blocks that are being cleaned. It may be the same linked list data structure, but that is dynamically changing to contain completely different blocks from the last time you looked. However, one thing you can say is that if block B hasn't been written to since you last checked, then any blocks older than that haven't been written to either. Of course, the problem is in finding block B again without re-scanning from the LRU end. Is there any non-intrusive way we could add a bookmark into the conveyer-belt? (mixing my metaphors again :-) Any blocks written to would move up the cache, effectively moving the bookmark lower. Enough activity would cause the bookmark to drop off the end. If that isn't the case though, we know we can safely skip any blocks older than the bookmark. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] RC2 and open issues
On Tue, Dec 21, 2004 at 10:26:48AM -0500, Tom Lane wrote: Richard Huxton [EMAIL PROTECTED] writes: However, one thing you can say is that if block B hasn't been written to since you last checked, then any blocks older than that haven't been written to either. [ itch... ] Can you? I don't recall exactly when a block gets pushed up the ARC list during a ReadBuffer/WriteBuffer cycle, but at the very least I'd have to say that this assumption is vulnerable to race conditions. Also, the cntxDirty mechanism allows a block to be dirtied without changing the ARC state at all. I am not very clear on whether Vadim added that mechanism just for performance or because there were fundamental deadlock issues without it; but in either case we'd have to think long and hard about taking it out for the bgwriter's benefit. OTOH, ISTM that it's ok if the bgwriter occasionally misses blocks. These blocks would either result in a backend or the checkpointer having to write out a block (not so great), or the bgwriter could occasionally ignore it's bookmart and restart it's scan from the LRU. Of course I'm assuming that any race-conditions could be made to impact only the bgwriter and nothing else, which may be a bad assumption. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[HACKERS] RC2 and open issues
We are now packaging RC2. If nothing comes up after RC2 is released, we can move to final release. The open items list is attached. The doc changes can be easily completed before final. The only code issue left is with bgwriter. We always knew we needed to find better defaults for its parameters, but we are only now finding more fundamental issues. I think the summary I have seen recently pegs it right --- our use of % of dirty buffers requires a scan of the entire buffer cache, and the current delay of bgwriter is too high, but we can't lower it because the buffer cache scan will become too expensive if done too frequently. I think the ideal solution would be to remove bgwriter_percent or change it to be a percentage of all buffers, not just dirty buffers, so we don't have to scan the entire list. If we set the new value to 10% with a delay of 1 second, and the bgwriter remembers the place it stopped scanning the buffer cache, you will clean out the buffer cache completely every 10 seconds. Right now it seems no one can find proper values. We were clear that this was an issue but it is bad news that we are only addressing it during RC. The 8.1 solution is to have some feedback system so writes by individual backends cause the bgwriter to work more frequently. The big question is what to do during RC2? Do we just leave it as suboptimal knowing we will revisit it in 8.1 or try an incremental solution for 8.0 that might work better. We have to decide now. --- PostgreSQL 8.0 Open Items = Current version at http://candle.pha.pa.us/cgi-bin/pgopenitems. Changes --- * change bgwriter buffer scan behavior? * adjust bgwriter defaults Documentation - * synchonize supported encodings and docs * improve external interfaces documentation section * manual pages Fixed Since Last Beta - -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] RC2 and open issues
Bruce Momjian [EMAIL PROTECTED] writes: I think the ideal solution would be to remove bgwriter_percent or change it to be a percentage of all buffers, not just dirty buffers, so we don't have to scan the entire list. If we set the new value to 10% with a delay of 1 second, and the bgwriter remembers the place it stopped scanning the buffer cache, you will clean out the buffer cache completely every 10 seconds. But we don't *want* it to clean out the buffer cache completely. There's no point in writing a hot page every few seconds. So I don't think I believe in remembering where we stopped anyway. I think there's a reasonable case to be made for redefining bgwriter_percent as the max percent of the total buffer list to scan (not the max percent of the list to return --- Jan correctly pointed out that the latter is useless). Then we could modify StrategyDirtyBufferList so that the percent and maxpages parameters are passed in, so it can stop as soon as either one is satisfied. This would be a fairly small/safe code change and I wouldn't have a problem doing it even at this late stage of the cycle. Howeve ... we would have to crank up the default bgwriter_percent, and I don't know if we have any better idea what to set it to after such a change than we do now ... regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] RC2 and open issues
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: I think the ideal solution would be to remove bgwriter_percent or change it to be a percentage of all buffers, not just dirty buffers, so we don't have to scan the entire list. If we set the new value to 10% with a delay of 1 second, and the bgwriter remembers the place it stopped scanning the buffer cache, you will clean out the buffer cache completely every 10 seconds. But we don't *want* it to clean out the buffer cache completely. You are only cleaning out in pieces over a 10 second period so it is getting dirty. You are not scanning the entire buffer at one time. There's no point in writing a hot page every few seconds. So I don't think I believe in remembering where we stopped anyway. I was thinking if you are doing this scanning every X milliseconds then after a while the front of the buffer cache will be mostly clean and the end will be dirty so you will always be going over the same early ones to get to the later dirty ones. Remembering the location gives the scan more uniform coverage of the buffer cache. You need a clock sweep like BSD uses (and probably others). I think there's a reasonable case to be made for redefining bgwriter_percent as the max percent of the total buffer list to scan (not the max percent of the list to return --- Jan correctly pointed out that the latter is useless). Then we could modify StrategyDirtyBufferList so that the percent and maxpages parameters are passed in, so it can stop as soon as either one is satisfied. This would be a fairly small/safe code change and I wouldn't have a problem doing it even at this late stage of the cycle. Howeve ... we would have to crank up the default bgwriter_percent, and I don't know if we have any better idea what to set it to after such a change than we do now ... Once we make the change we will have to get our testers working on it. We need those figure to change over time based on backends doing writes but ath isn't going to happen for 8.0. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] RC2 and open issues
Bruce Momjian [EMAIL PROTECTED] writes: You need a clock sweep like BSD uses (and probably others). No, that's *fundamentally* wrong. The reason we are going to the trouble of maintaining a complicated cache algorithm like ARC is so that we can tell the heavily used pages from the lesser used ones. To throw away that knowledge in favor of doing I/O with a plain clock sweep algorithm is just wrong. What's more, I don't even understand what clock sweep would mean given that the ordering of the list is constantly changing. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] RC2 and open issues
Bruce Momjian [EMAIL PROTECTED] writes: I am confused. If we change the percentage to be X% of the entire buffer cache, and we set it to 1%, and we exit when either the dirty pages or % are reached, don't we end up just scanning the first 1% of the cache over and over again? Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] RC2 and open issues
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: I am confused. If we change the percentage to be X% of the entire buffer cache, and we set it to 1%, and we exit when either the dirty pages or % are reached, don't we end up just scanning the first 1% of the cache over and over again? Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. So we are not scanning by buffer address but using the LRU list? Are we sure they are mostly dirty? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] RC2 and open issues
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. So we are not scanning by buffer address but using the LRU list? Are we sure they are mostly dirty? No. The entire point is to keep the LRU end of the list mostly clean. Now that you mention it, it might be interesting to try the approach of doing a clock scan on the buffer array and ignoring the ARC lists entirely. That would be a fundamentally different way of envisioning what the bgwriter is supposed to do, though. I think the main reason Jan didn't try that was he wanted to be sure the LRU page was usually clean so that backends would seldom end up doing writes for themselves when they needed to get a free buffer. Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. But that's way beyond what we have time for in the 8.0 cycle. regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] RC2 and open issues
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. So we are not scanning by buffer address but using the LRU list? Are we sure they are mostly dirty? No. The entire point is to keep the LRU end of the list mostly clean. Now that you mention it, it might be interesting to try the approach of doing a clock scan on the buffer array and ignoring the ARC lists entirely. That would be a fundamentally different way of envisioning what the bgwriter is supposed to do, though. I think the main reason Jan didn't try that was he wanted to be sure the LRU page was usually clean so that backends would seldom end up doing writes for themselves when they needed to get a free buffer. Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. But that's way beyond what we have time for in the 8.0 cycle. OK, so we scan from the end of the LRU. If we scan X% and find _no_ dirty buffers perhaps we should start where we left off last time. If we don't start where we left off, I am thinking if you do a lot of writes then do nothing, the next checkpoint would be huge because a lot of the LRU will be dirty because the bgwriter never got to it. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] RC2 and open issues
On Mon, 20 Dec 2004, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: Exactly. But 1% would be uselessly small with this definition. Offhand I'd think something like 50% might be a starting point; maybe even more. What that says is that a page isn't a candidate to be written out by the bgwriter until it's fallen halfway down the LRU list. So we are not scanning by buffer address but using the LRU list? Are we sure they are mostly dirty? No. The entire point is to keep the LRU end of the list mostly clean. Now that you mention it, it might be interesting to try the approach of doing a clock scan on the buffer array and ignoring the ARC lists entirely. That would be a fundamentally different way of envisioning what the bgwriter is supposed to do, though. I think the main reason Jan didn't try that was he wanted to be sure the LRU page was usually clean so that backends would seldom end up doing writes for themselves when they needed to get a free buffer. Neil and I spoke with Jan briefly last week and he mentioned a few different approaches he'd been tossing over. Firstly, for alternative runs, start X% on from the LRU, so that we aren't scanning clean buffers all the time. Secondly, follow something like the approach you've mentioned above but remember the offset. So, if we're scanning 10%, after 10 runs we will have written out all buffers. I was also thinking of benchmarking the effect of changing the algorithm in StrategyDirtyBufferList(): currently, for each iteration of the loop we read a buffer from each of T1 and T2. I was wondering what effect reading T1 first then T2 and vice versa would have on performance. I haven't thought about this too hard, though, so it might be wrong headed. Maybe we need a hybrid approach: clean a few percent of the LRU end of the ARC list in order to keep backends from blocking on writes, plus run a clock scan to keep checkpoints from having to do much. But that's way beyond what we have time for in the 8.0 cycle. Definately. regards, tom lane Thanks, Gavin ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] RC2 and open issues
Gavin Sherry wrote: Neil and I spoke with Jan briefly last week and he mentioned a few different approaches he'd been tossing over. Firstly, for alternative runs, start X% on from the LRU, so that we aren't scanning clean buffers all the time. Secondly, follow something like the approach you've mentioned above but remember the offset. So, if we're scanning 10%, after 10 runs we will have written out all buffers. I was also thinking of benchmarking the effect of changing the algorithm in StrategyDirtyBufferList(): currently, for each iteration of the loop we read a buffer from each of T1 and T2. I was wondering what effect reading T1 first then T2 and vice versa would have on performance. I haven't thought about this too hard, though, so it might be wrong headed. So we are all thinking in the same direction. We might have only a few days to finalize this before final release. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] RC2 and open issues
Gavin Sherry [EMAIL PROTECTED] writes: I was also thinking of benchmarking the effect of changing the algorithm in StrategyDirtyBufferList(): currently, for each iteration of the loop we read a buffer from each of T1 and T2. I was wondering what effect reading T1 first then T2 and vice versa would have on performance. Looking at StrategyGetBuffer, it definitely seems like a good idea to try to keep the bottom end of both T1 and T2 lists clean. But we should work at T1 a bit harder. The insight I take away from today's discussion is that there are two separate goals here: try to keep backends that acquire a buffer via StrategyGetBuffer from being fed a dirty buffer they have to write, and try to keep the next upcoming checkpoint from having too much work to do. Those are both laudable goals but I hadn't really seen before that they may require different strategies to achieve. I'm liking the idea that bgwriter should alternate between doing writes in pursuit of the one goal and doing writes in pursuit of the other. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org