AW: [HACKERS] CommitDelay performance improvement
I agree that 30k looks like the magic delay, and probably 30/5 would be a good conservative choice. But now I think about the choice of number, I think it must vary with the speed of the machine and length of the transactions; at 20tps, each TX is completing in around 50ms. I think disk speed should probably be the main factor. After the first run 30k/5 also seemed the best here, but running the test again shows, that the results are only reproducible after a new initdb. Anybody else see reproducible results without previous initdb ? One thing I noticed is, that WAL_FILES needs to be at least 4, because one run fills up to 3 logfiles, and we don't want to measure WAL formating. Andreas
Re: AW: [HACKERS] CommitDelay performance improvement
At 10:56 27/02/01 +0100, Zeugswetter Andreas SB wrote: I agree that 30k looks like the magic delay, and probably 30/5 would be a good conservative choice. But now I think about the choice of number, I think it must vary with the speed of the machine and length of the transactions; at 20tps, each TX is completing in around 50ms. I think disk speed should probably be the main factor. After the first run 30k/5 also seemed the best here, but running the test again shows, that the results are only reproducible after a new initdb. Anybody else see reproducible results without previous initdb ? I think we want something that reflects the chance of a time-saving as a result of a wait, which is why I suggested having each backend monitor commits/sec, then basing the delay on some % of that number. eg. if commits/sec = 1, then it's either low-load, or long tx's, in either case CommitDelay won't help. Similarly, if we have 1000 commits/sec, then we have a very fast system and/or disk, and CommitDelay of 10ms is clearly glacial. AFAICS, dynamically monitoring commits/sec (or a similar statistic) is TOWTG, but in all cases we need to set a max on CommitDelay to prevent individual TXs getting too long (although I am unsure if the latter is *really* necessary, it is far better to be safe). Note: commits/sec need to be kept for each backend so we can remove the contribution of the backend that is considering waiting. Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
AW: [HACKERS] CommitDelay performance improvement
One thing that I remember from a performance test we once did is, that the results are a lot more realistic, better and more stable, if you try to decouple the startup of the different clients a little bit, so they are not all in the same section of code at the same time. We inserted random usleeps, I forgot what range, but 10 ms seem reasonable to me. This was another database, but it might also apply here. Andreas
Re: [HACKERS] CommitDelay performance improvement
On Sun, Feb 25, 2001 at 12:41:28AM -0500, Tom Lane wrote: Attached are graphs from more thorough runs of pgbench with a commit delay that occurs only when at least N other backends are running active transactions. ... It's not entirely clear what set of parameters is best, but it is absolutely clear that a flat zero-commit-delay policy is NOT best. The test conditions are postmaster options -N 100 -B 1024, pgbench scale factor 10, pgbench -t (transactions per client) 100. (Hence the results for a single client rely on only 100 transactions, and are pretty noisy. The noise level should decrease as the number of clients increases.) It's hard to interpret these results. In particular, "delay 10k, sibs 20" (10k,20), or cyan-triangle, is almost the same as "delay 50k, sibs 1" (50k,1), or green X. Those are pretty different parameters to get such similar results. The only really bad performers were (0), (10k,1), (100k,20). The best were (30k,1) and (30k,10), although (30k,5) also did well except at 40. Why would 30k be a magic delay, regardless of siblings? What happened at 40? At low loads, it seems (100k,1) (brown +) did best by far, which seems very odd. Even more odd, it did pretty well at very high loads but had problems at intermediate loads. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CommitDelay performance improvement
At 00:42 25/02/01 -0800, Nathan Myers wrote: The only really bad performers were (0), (10k,1), (100k,20). The best were (30k,1) and (30k,10), although (30k,5) also did well except at 40. Why would 30k be a magic delay, regardless of siblings? What happened at 40? I had assumed that 40 was one of the glitches - it would be good if Tom (or someone else) could rerun the suite, to see if we see the same dip. I agree that 30k looks like the magic delay, and probably 30/5 would be a good conservative choice. But now I think about the choice of number, I think it must vary with the speed of the machine and length of the transactions; at 20tps, each TX is completing in around 50ms. Probably the delay needs to be set at a value related to the average TX duration, and since that is not really a known figure, perhaps we should go with 30% of TX duration, with a max of 100k. Alternatively, can PG monitor the commits/second, then set the delay to reflect half of the average TX time (or 100ms, whichever is smaller)? Is this too baroque? Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
RE: [HACKERS] CommitDelay performance improvement
-Original Message- From: Tom Lane Attached are graphs from more thorough runs of pgbench with a commit delay that occurs only when at least N other backends are running active transactions. My initial try at this proved to be too noisy to tell much. The noise seems to be coming from WAL checkpoints that occur during a run and push down the reported TPS value for the particular case that's running. While we'd need to include WAL checkpoints to make an honest performance comparison against another RDBMS, I think they are best ignored for the purpose of figuring out what the commit-delay behavior ought to be. Accordingly, I modified my test script to minimize the occurrence of checkpoint activity during runs (see attached script). There are still some data points that are unexpectedly low compared to their neighbors; presumably these were affected by checkpoints or other system activity. It's not entirely clear what set of parameters is best, but it is absolutely clear that a flat zero-commit-delay policy is NOT best. The test conditions are postmaster options -N 100 -B 1024, pgbench scale factor 10, pgbench -t (transactions per client) 100. (Hence the results for a single client rely on only 100 transactions, and are pretty noisy. The noise level should decrease as the number of clients increases.) Comments anyone? How about the case with scaling factor 1 ? i.e Could your proposal detect lock conflicts in reality ? If so, I agree with your proposal. BTW there seems to be a misunderstanding about CommitDelay, i.e CommitDelay is completely a waste of time unless there's an overlap of commit. If other backends use the delay(cpu cycle) the delay is never a waste of time totally. Regards, Hiroshi Inoue
Re: [HACKERS] CommitDelay performance improvement
"Hiroshi Inoue" [EMAIL PROTECTED] writes: How about the case with scaling factor 1 ? i.e Could your proposal detect lock conflicts in reality ? The code is set up to not count backends that are waiting on locks. That is, to do a commit delay there must be at least N other backends that are in transactions, have written at least one XLOG entry in their transaction (so it's not a read-only xact and will need to write a commit record), and are not waiting on a lock. Is that what you meant? BTW there seems to be a misunderstanding about CommitDelay, i.e CommitDelay is completely a waste of time unless there's an overlap of commit. If other backends use the delay(cpu cycle) the delay is never a waste of time totally. Good point. In fact, if we measure only the total throughput in transactions per second then the commit delay will not appear to be hurting performance no matter how long it is, so long as other backends are in the RUN state for the whole delay. This suggests that pgbench should also measure the average transaction time seen by any one client. Is that a simple change? regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Philip Warner [EMAIL PROTECTED] writes: At 00:42 25/02/01 -0800, Nathan Myers wrote: The only really bad performers were (0), (10k,1), (100k,20). The best were (30k,1) and (30k,10), although (30k,5) also did well except at 40. Why would 30k be a magic delay, regardless of siblings? What happened at 40? I had assumed that 40 was one of the glitches - it would be good if Tom (or someone else) could rerun the suite, to see if we see the same dip. Yes, I assumed the same. I posted the script; could someone else make the same run? We really need more than one test case ;-) I agree that 30k looks like the magic delay, and probably 30/5 would be a good conservative choice. But now I think about the choice of number, I think it must vary with the speed of the machine and length of the transactions; at 20tps, each TX is completing in around 50ms. Yes, I think so too. This machine is able to do about 40 pgbench tr/sec single-client with fsync off, so the computational load is right about 25msec per transaction. That's presumably why 30msec looks like a good delay number. What interested me was that there doesn't seem to be a very sharp peak; anything from 10 to 100 msec yields fairly comparable results. This is a good thing ... if there *were* a sharp peak at the average xact length, tuning the delay parameter would be an impossible task in real-world cases where the transactions aren't all alike. On the data so far, I'm inclined to go with 10k/5 as the default, so as not to risk wasting time with overly long delays on machines that are faster than this one. But we really need some data from other machines before deciding. It'd be nice to see some results with 10k delays too, from a machine where the kernel supports better-than-10msec delay resolution. Where's the Alpha contingent?? regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
[EMAIL PROTECTED] (Nathan Myers) writes: At low loads, it seems (100k,1) (brown +) did best by far, which seems very odd. Even more odd, it did pretty well at very high loads but had problems at intermediate loads. In theory, all these variants should behave exactly the same for a single client, since there will be no commit delay in any of 'em in that case. I'm inclined to write off the aberrant result for 100k/1 as due to outside factors --- maybe the WAL file happened to be located in a particularly convenient place on the disk during that run, or some such. Since there's only 100 transactions in that test, it wouldn't take much to affect the result. Likewise, the places where one mid-load datapoint is well below either neighbor are probably due to outside factors --- either a background WAL checkpoint or other activity on the machine, mail arrival for instance. I left the machine alone during the test, but I didn't bother to shut down the usual system services. My feeling is that this test run tells us that zero commit delay is inferior to nonzero under these test conditions, but there's too much noise to pick out one of the nonzero-delay parameter combinations as being clearly better than the rest. (BTW, I did repeat the zero-delay series just to be sure it wasn't itself an outlier...) regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Tom Lane wrote: Philip Warner [EMAIL PROTECTED] writes: At 00:42 25/02/01 -0800, Nathan Myers wrote: The only really bad performers were (0), (10k,1), (100k,20). The best were (30k,1) and (30k,10), although (30k,5) also did well except at 40. Why would 30k be a magic delay, regardless of siblings? What happened at 40? I had assumed that 40 was one of the glitches - it would be good if Tom (or someone else) could rerun the suite, to see if we see the same dip. Yes, I assumed the same. I posted the script; could someone else make the same run? We really need more than one test case ;-) I could find the sciript but seem to have missed your change about commit_siblings. Where could I get it ? Regards, Hiroshi Inoue
Re: [HACKERS] CommitDelay performance improvement
Basically, I am not sure how much we lose by doing the delay after returning COMMIT, and I know we gain quite a bit by enabling us to group fsync calls. If included, this should be an option only, and not the default option. Sure it should never become the default, because the "D" in ACID is just about forbidding this kind of behaviour... -- Dominique
Re: [HACKERS] CommitDelay performance improvement
On Sat, Feb 24, 2001 at 01:07:17AM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Nathan Myers) writes: I see, I had it backwards: N=0 corresponds to "always delay", and N=infinity (~0) is "never delay", or what you call zero delay. N=1 is not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting, where M is the number of backends, or the number of backends with begun transactions, or something. N=10 would be conservative (and maybe pointless) just because it would hardly ever trigger a delay. Why is N=1 not interesting? That requires at least one other backend to be in a transaction before you'll delay. That would seem to be the minimum useful value --- N=0 (always delay) seems clearly to be too stupid to be useful. N=1 seems arbitrarily aggressive. It assumes any open transaction will commit within a few milliseconds; otherwise the delay is wasted. On a fairly busy system, it seems to me to impose a strict upper limit on transaction rate for any client, regardless of actual system I/O load. (N=0 would impose that strict upper limit even for a single client.) Delaying isn't free, because it means that the client can't turn around and do even a cheap query for a while. In a sense, when you delay you are charging the committer a tax to try to improve overall throughput. If the delay lets you reduce I/O churn enough to increase the total bandwidth, then it was worthwhile; if not, you just cut system performance, and responsiveness to each client, for nothing. The above suggests that maybe N should depend on recent disk I/O activity, so you get a larger N (and thus less likely delay and more certain payoff) for a more lightly-loaded system. On a system that has maxed its I/O bandwidth, clients will suffer delays anyhow, so they might as well suffer controlled delays that result in better total throughput. On a lightly-loaded system there's no need, or payoff, for such throttling. Can we measure disk system load by averaging the times taken for fsyncs? Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CommitDelay performance improvement
Attached are graphs from more thorough runs of pgbench with a commit delay that occurs only when at least N other backends are running active transactions. My initial try at this proved to be too noisy to tell much. The noise seems to be coming from WAL checkpoints that occur during a run and push down the reported TPS value for the particular case that's running. While we'd need to include WAL checkpoints to make an honest performance comparison against another RDBMS, I think they are best ignored for the purpose of figuring out what the commit-delay behavior ought to be. Accordingly, I modified my test script to minimize the occurrence of checkpoint activity during runs (see attached script). There are still some data points that are unexpectedly low compared to their neighbors; presumably these were affected by checkpoints or other system activity. It's not entirely clear what set of parameters is best, but it is absolutely clear that a flat zero-commit-delay policy is NOT best. The test conditions are postmaster options -N 100 -B 1024, pgbench scale factor 10, pgbench -t (transactions per client) 100. (Hence the results for a single client rely on only 100 transactions, and are pretty noisy. The noise level should decrease as the number of clients increases.) Comments anyone? regards, tom lane hppabench.gif #! /bin/sh # Expected postmaster options: -N 100 -B 1024 -c checkpoint_timeout=1800 # Recommended pgbench setup: pgbench -i -s 10 bench for del in 0 ; do for sib in 1 ; do for cli in 1 10 20 30 40 50 ; do echo "commit_delay = $del" echo "commit_siblings = $sib" psql -c "vacuum branches; vacuum tellers; delete from history; vacuum history; checkpoint;" bench PGOPTIONS="-c commit_delay=$del -c commit_siblings=$sib" \ pgbench -c $cli -t 100 -n bench done done done for del in 1 3 5 10 ; do for sib in 1 5 10 20 ; do for cli in 1 10 20 30 40 50 ; do echo "commit_delay = $del" echo "commit_siblings = $sib" psql -c "vacuum branches; vacuum tellers; delete from history; vacuum history; checkpoint;" bench PGOPTIONS="-c commit_delay=$del -c commit_siblings=$sib" \ pgbench -c $cli -t 100 -n bench done done done
Re: [HACKERS] CommitDelay performance improvement
At 00:41 25/02/01 -0500, Tom Lane wrote: Comments anyone? Don't suppose you could post the original data? Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Re: [HACKERS] CommitDelay performance improvement
Philip Warner [EMAIL PROTECTED] writes: Don't suppose you could post the original data? Sure. regards, tom lane commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.996953(including connections establishing) tps = 11.051216(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.779923(including connections establishing) tps = 17.924390(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 17.289815(including connections establishing) tps = 17.429343(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 17.292171(including connections establishing) tps = 17.432905(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 17.733478(including connections establishing) tps = 17.913251(excluding connections establishing) commit_delay = 0 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 18.325273(including connections establishing) tps = 18.534556(excluding connections establishing) commit_delay = 1 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 10.449347(including connections establishing) tps = 10.500278(excluding connections establishing) commit_delay = 1 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 17.865721(including connections establishing) tps = 18.015078(excluding connections establishing) commit_delay = 1 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 17.980234(including connections establishing) tps = 18.131986(excluding connections establishing) commit_delay = 1 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 30 number of transactions per client: 100 number of transactions actually processed: 3000/3000 tps = 18.858489(including connections establishing) tps = 19.027436(excluding connections establishing) commit_delay = 1 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 40 number of transactions per client: 100 number of transactions actually processed: 4000/4000 tps = 19.320221(including connections establishing) tps = 19.496999(excluding connections establishing) commit_delay = 1 commit_siblings = 1 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 50 number of transactions per client: 100 number of transactions actually processed: 5000/5000 tps = 19.440978(including connections establishing) tps = 19.621221(excluding connections establishing) commit_delay = 1 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 1 number of transactions per client: 100 number of transactions actually processed: 100/100 tps = 11.298701(including connections establishing) tps = 11.357102(excluding connections establishing) commit_delay = 1 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 10 number of transactions per client: 100 number of transactions actually processed: 1000/1000 tps = 19.722266(including connections establishing) tps = 19.903373(excluding connections establishing) commit_delay = 1 commit_siblings = 5 CHECKPOINT transaction type: TPC-B (sort of) scaling factor: 10 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 19.042737(including connections establishing) tps = 19.214042(excluding connections establishing) commit_delay = 1 commit_siblings = 5
[HACKERS] CommitDelay performance improvement
Looking at the XLOG stuff, I notice that we already have a field (logRec) in the per-backend PROC structures that shows whether a transaction is currently in progress with at least one change made (ie at least one XLOG entry written). It would be very easy to extend the existing code so that the commit delay is not done unless there is at least one other backend with nonzero logRec --- or, more generally, at least N other backends with nonzero logRec. We cannot tell if any of them are actually nearing their commits, but this seems better than just blindly waiting. Larger values of N would presumably improve the odds that at least one of them is nearing its commit. A further refinement, still quite cheap to implement since the info is in the PROC struct, would be to not count backends that are blocked waiting for locks. These guys are less likely to be ready to commit in the next few milliseconds than the guys who are actively running; indeed they cannot commit until someone else has committed/aborted to release the lock they need. Comments? What should the threshold N be ... or do we need to make that a tunable parameter? regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: Why not just set a flag in there when someone nears commit and clear when they are about to commit? Define "nearing commit", in such a way that you can specify where you plan to set that flag. Is there significant time between entry of CommitTransaction() and the fsync()? Maybe not. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Looking at the XLOG stuff, I notice that we already have a field (logRec) in the per-backend PROC structures that shows whether a transaction is currently in progress with at least one change made (ie at least one XLOG entry written). It would be very easy to extend the existing code so that the commit delay is not done unless there is at least one other backend with nonzero logRec --- or, more generally, at least N other backends with nonzero logRec. We cannot tell if any of them are actually nearing their commits, but this seems better than just blindly waiting. Larger values of N would presumably improve the odds that at least one of them is nearing its commit. Why not just set a flag in there when someone nears commit and clear when they are about to commit? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: Why not just set a flag in there when someone nears commit and clear when they are about to commit? Define "nearing commit", in such a way that you can specify where you plan to set that flag. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: Is there significant time between entry of CommitTransaction() and the fsync()? Maybe not. I doubt it. No I/O anymore, anyway, unless the commit record happens to overrun an xlog block boundary. That's what I was afraid of. Since we don't write the dirty blocks to the kernel anymore, we don't really have much happening before someone says they are about to commit. In the old days, we were write()'ing those buffers, and we had some delay and kernel calls in there. Guess that idea is dead. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
On Fri, Feb 23, 2001 at 11:32:21AM -0500, Tom Lane wrote: A further refinement, still quite cheap to implement since the info is in the PROC struct, would be to not count backends that are blocked waiting for locks. These guys are less likely to be ready to commit in the next few milliseconds than the guys who are actively running; indeed they cannot commit until someone else has committed/aborted to release the lock they need. Comments? What should the threshold N be ... or do we need to make that a tunable parameter? Once you make it tuneable, you're stuck with it. You can always add a knob later, after somebody discovers a real need. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CommitDelay performance improvement
On Fri, Feb 23, 2001 at 11:32:21AM -0500, Tom Lane wrote: A further refinement, still quite cheap to implement since the info is in the PROC struct, would be to not count backends that are blocked waiting for locks. These guys are less likely to be ready to commit in the next few milliseconds than the guys who are actively running; indeed they cannot commit until someone else has committed/aborted to release the lock they need. Comments? What should the threshold N be ... or do we need to make that a tunable parameter? Once you make it tuneable, you're stuck with it. You can always add a knob later, after somebody discovers a real need. I wonder if Tom should implement it, but leave it at zero until people can report that a non-zero helps. We already have the parameter, we can just make it smarter and let people test it. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
[EMAIL PROTECTED] (Nathan Myers) writes: Comments? What should the threshold N be ... or do we need to make that a tunable parameter? Once you make it tuneable, you're stuck with it. You can always add a knob later, after somebody discovers a real need. If we had a good idea what the default level should be, I'd be willing to go without a knob. I'm thinking of a default of about 5 (ie, at least 5 other active backends to trigger a commit delay) ... but I'm not so confident of that that I think it needn't be tunable. It's really dependent on your average and peak transaction lengths, and that's going to vary across installations, so unless we want to try to make it self-adjusting, a knob seems like a good idea. A self-adjusting delay might well be a great idea, BTW, but I'm trying to be conservative about how much complexity we should add right now. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
[EMAIL PROTECTED] (Nathan Myers) writes: Comments? What should the threshold N be ... or do we need to make that a tunable parameter? Once you make it tuneable, you're stuck with it. You can always add a knob later, after somebody discovers a real need. If we had a good idea what the default level should be, I'd be willing to go without a knob. I'm thinking of a default of about 5 (ie, at least 5 other active backends to trigger a commit delay) ... but I'm not so confident of that that I think it needn't be tunable. It's really dependent on your average and peak transaction lengths, and that's going to vary across installations, so unless we want to try to make it self-adjusting, a knob seems like a good idea. A self-adjusting delay might well be a great idea, BTW, but I'm trying to be conservative about how much complexity we should add right now. OH, so you are saying N backends should have dirtied buffers before doing the delay? Hmm, that seems almost untunable to me. Let's suppose we decide to sleep. When we wake up, can we know that someone else has fsync'ed for us? And if they have, should we be more likely to fsync() in the future? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
And if they have, should we be more likely to fsync() in the future? I meant more likely to sleep(). You mean less likely. My thought for a self-adjusting delay was to ratchet the delay up a little every time it succeeds in avoiding an fsync, and down a little every time it fails to do so. No change when we don't delay at all (because of no other active backends). But testing this and making sure it behaves reasonably seems like more work than we should try to accomplish before 7.1. It could be tough. Imagine the delay increasing to 3 seconds? Seems there has to be an upper bound on the sleep. The more you delay, the more likely you will be to find someone to fsync you. Are we waking processes up after we have fsync()'ed them? If so, we can keep increasing the delay. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: A self-adjusting delay might well be a great idea, BTW, but I'm trying to be conservative about how much complexity we should add right now. OH, so you are saying N backends should have dirtied buffers before doing the delay? Hmm, that seems almost untunable to me. Let's suppose we decide to sleep. When we wake up, can we know that someone else has fsync'ed for us? XLogFlush will find that it has nothing to do, so yes we can. And if they have, should we be more likely to fsync() in the future? You mean less likely. My thought for a self-adjusting delay was to ratchet the delay up a little every time it succeeds in avoiding an fsync, and down a little every time it fails to do so. No change when we don't delay at all (because of no other active backends). But testing this and making sure it behaves reasonably seems like more work than we should try to accomplish before 7.1. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: It could be tough. Imagine the delay increasing to 3 seconds? Seems there has to be an upper bound on the sleep. The more you delay, the more likely you will be to find someone to fsync you. Good point, and an excellent illustration of the fact that self-adjusting algorithms aren't that easy to get right the first time ;-) Are we waking processes up after we have fsync()'ed them? Not at the moment. That would be another good mechanism to investigate for 7.2; but right now there's no infrastructure that would allow a backend to discover which other ones were sleeping for fsync. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
On Fri, Feb 23, 2001 at 05:18:19PM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Nathan Myers) writes: Comments? What should the threshold N be ... or do we need to make that a tunable parameter? Once you make it tuneable, you're stuck with it. You can always add a knob later, after somebody discovers a real need. If we had a good idea what the default level should be, I'd be willing to go without a knob. I'm thinking of a default of about 5 (ie, at least 5 other active backends to trigger a commit delay) ... but I'm not so confident of that that I think it needn't be tunable. It's really dependent on your average and peak transaction lengths, and that's going to vary across installations, so unless we want to try to make it self-adjusting, a knob seems like a good idea. A self-adjusting delay might well be a great idea, BTW, but I'm trying to be conservative about how much complexity we should add right now. When thinking about tuning N, I like to consider what are the interesting possible values for N: 0: Ignore any other potential committers. 1: The minimum possible responsiveness to other committers. 5: Tom's guess for what might be a good choice. 10: Harry's guess. ~0: Always delay. I would rather release with N=1 than with 0, because it actually responds to conditions. What N might best be, 1, probably varies on a lot of hard-to-guess parameters. It seems to me that comparing various choices (and other, more interesting, algorithms) to the N=1 case would be more productive than comparing them to the N=0 case, so releasing at N=1 would yield better statistics for actually tuning in 7.2. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CommitDelay performance improvement
When thinking about tuning N, I like to consider what are the interesting possible values for N: 0: Ignore any other potential committers. 1: The minimum possible responsiveness to other committers. 5: Tom's guess for what might be a good choice. 10: Harry's guess. ~0: Always delay. I would rather release with N=1 than with 0, because it actually responds to conditions. What N might best be, 1, probably varies on a lot of hard-to-guess parameters. It seems to me that comparing various choices (and other, more interesting, algorithms) to the N=1 case would be more productive than comparing them to the N=0 case, so releasing at N=1 would yield better statistics for actually tuning in 7.2. We don't release code becuase it has better tuning oportunities for later releases. What we can do is give people parameters where the default is safe, and they can play and report to us. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: It could be tough. Imagine the delay increasing to 3 seconds? Seems there has to be an upper bound on the sleep. The more you delay, the more likely you will be to find someone to fsync you. Good point, and an excellent illustration of the fact that self-adjusting algorithms aren't that easy to get right the first time ;-) I see. I am concerned that anything done to 7.1 at this point may cause problems with performance under certain circumstances. Let's see what the new code shows our testers. Are we waking processes up after we have fsync()'ed them? Not at the moment. That would be another good mechanism to investigate for 7.2; but right now there's no infrastructure that would allow a backend to discover which other ones were sleeping for fsync. Can we put the backends to sleep waiting for a lock, and have them wake up later? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: Can we put the backends to sleep waiting for a lock, and have them wake up later? Locks don't have timeouts. There is no existing mechanism that will serve this purpose; we'll have to create a new one. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
On Fri, Feb 23, 2001 at 06:37:06PM -0500, Bruce Momjian wrote: When thinking about tuning N, I like to consider what are the interesting possible values for N: 0: Ignore any other potential committers. 1: The minimum possible responsiveness to other committers. 5: Tom's guess for what might be a good choice. 10: Harry's guess. ~0: Always delay. I would rather release with N=1 than with 0, because it actually responds to conditions. What N might best be, 1, probably varies on a lot of hard-to-guess parameters. It seems to me that comparing various choices (and other, more interesting, algorithms) to the N=1 case would be more productive than comparing them to the N=0 case, so releasing at N=1 would yield better statistics for actually tuning in 7.2. We don't release code because it has better tuning opportunities for later releases. What we can do is give people parameters where the default is safe, and they can play and report to us. Perhaps I misunderstood. I had perceived N=1 as a conservative choice that was nevertheless preferable to N=0. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CommitDelay performance improvement
It seems to me that comparing various choices (and other, more interesting, algorithms) to the N=1 case would be more productive than comparing them to the N=0 case, so releasing at N=1 would yield better statistics for actually tuning in 7.2. We don't release code because it has better tuning opportunities for later releases. What we can do is give people parameters where the default is safe, and they can play and report to us. Perhaps I misunderstood. I had perceived N=1 as a conservative choice that was nevertheless preferable to N=0. I think zero delay is the conservative choice at this point, unless we hear otherwise from testers. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: Can we put the backends to sleep waiting for a lock, and have them wake up later? Locks don't have timeouts. There is no existing mechanism that will serve this purpose; we'll have to create a new one. That is what I suspected. Having thought about it, We currently have a few options: 1) let every backend fsync on its own 2) try to delay backends so they all fsync() at the same time 3) delay fsync until after commit Items 2 and 3 attempt to bunch up fsyncs. Option 2 has backends waiting to fsync() on the expectation that some other backend may commit soon. Option 3 I may turn out to be the best solution. No matter how smart we make the code, we will never know for sure if someone is about to commit and whether it is worth waiting. My idea would be to let committing backends return "COMMIT" to the user, and set a need_fsync flag that is guaranteed to cause an fsync within X milliseconds. This way, if other backends commit in the next X millisecond, they can all use one fsync(). Now, I know many will complain that we are returning commit while not having the stuff on the platter. But consider, we only lose data from a OS crash or hardware failure. Do people who commit something, and then the machines crashes 2 milliseconds after the commit, really expect the data to be on the disk when they restart? Maybe they do, but it seems the benefit of grouped fsyncs() is large enough that many will say they would rather have this option. This was my point long ago that we could offer sub-second reliability with no-fsync performance if we just had some process running that wrote dirty pages and fsynced every 20 milliseconds. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
At 21:31 23/02/01 -0500, Bruce Momjian wrote: Now, I know many will complain that we are returning commit while not having the stuff on the platter. You're definitely right there. Maybe they do, but it seems the benefit of grouped fsyncs() is large enough that many will say they would rather have this option. I'd prefer to wait for a lock manager that supports timeouts and contention notification. Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Re: [HACKERS] CommitDelay performance improvement
At 21:31 23/02/01 -0500, Bruce Momjian wrote: Now, I know many will complain that we are returning commit while not having the stuff on the platter. You're definitely right there. Maybe they do, but it seems the benefit of grouped fsyncs() is large enough that many will say they would rather have this option. I'd prefer to wait for a lock manager that supports timeouts and contention notification. I understand, and if that was going to fix the problem completely, but it isn't. It is just going to allow us more flexibility at guessing who may be about to commit. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
At 11:32 23/02/01 -0500, Tom Lane wrote: Looking at the XLOG stuff, I notice that we already have a field (logRec) in the per-backend PROC structures that shows whether a transaction is currently in progress with at least one change made (ie at least one XLOG entry written). Would it be worth adding a field 'waiting for fsync since xxx', so the second process can (a) log that it is expecting someone else to FSYNC (for perf stats, if we want them), and (b) wait for (xxx + delta)ms/us etc? Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Re: [HACKERS] CommitDelay performance improvement
At 23:14 23/02/01 -0500, Bruce Momjian wrote: There is one more thing. Even though the kernel says the data is on the platter, it still may not be there. This is true, but it does not mean we should say 'the disk is slightly unreliable, so we can be too'. Also, IIRC, the last time this was discussed, someone commented that buying expensive disks and a UPS gets you reliability (barring a direct lightining strike) - it had something to do with write-ordering and hardware caches. In any case, I'd hate to see DB design decisions based closely on harware capability. At least two of my customers use high performance ram disks for databases - do these also suffer from 'flush is not really flush' problems? Basically, I am not sure how much we lose by doing the delay after returning COMMIT, and I know we gain quite a bit by enabling us to group fsync calls. If included, this should be an option only, and not the default option. In fact I'd quite like to see such a feature, although I'd not only do a 'flush every X ms', but I'd also do a 'flush every X transactions' - this way a DBA can say 'I dont mind losing the last 20 TXs in a crash'. Bear in mind that on a fast system, 20ms is a lot of transactions. Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Re: [HACKERS] CommitDelay performance improvement
On Fri, Feb 23, 2001 at 09:05:20PM -0500, Bruce Momjian wrote: It seems to me that comparing various choices (and other, more interesting, algorithms) to the N=1 case would be more productive than comparing them to the N=0 case, so releasing at N=1 would yield better statistics for actually tuning in 7.2. We don't release code because it has better tuning opportunities for later releases. What we can do is give people parameters where the default is safe, and they can play and report to us. Perhaps I misunderstood. I had perceived N=1 as a conservative choice that was nevertheless preferable to N=0. I think zero delay is the conservative choice at this point, unless we hear otherwise from testers. I see, I had it backwards: N=0 corresponds to "always delay", and N=infinity (~0) is "never delay", or what you call zero delay. N=1 is not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting, where M is the number of backends, or the number of backends with begun transactions, or something. N=10 would be conservative (and maybe pointless) just because it would hardly ever trigger a delay. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CommitDelay performance improvement
At 23:14 23/02/01 -0500, Bruce Momjian wrote: There is one more thing. Even though the kernel says the data is on the platter, it still may not be there. This is true, but it does not mean we should say 'the disk is slightly unreliable, so we can be too'. Also, IIRC, the last time this was discussed, someone commented that buying expensive disks and a UPS gets you reliability (barring a direct lightining strike) - it had something to do with write-ordering and hardware caches. In any case, I'd hate to see DB design decisions based closely on harware capability. At least two of my customers use high performance ram disks for databases - do these also suffer from 'flush is not really flush' problems? Well, I am saying we are being pretty rigid here when we may be on top of a system that is not, meaning that our rigidity is buying us little. Basically, I am not sure how much we lose by doing the delay after returning COMMIT, and I know we gain quite a bit by enabling us to group fsync calls. If included, this should be an option only, and not the default option. In fact I'd quite like to see such a feature, although I'd not only do a 'flush every X ms', but I'd also do a 'flush every X transactions' - this way a DBA can say 'I dont mind losing the last 20 TXs in a crash'. Bear in mind that on a fast system, 20ms is a lot of transactions. Yes, I can see this as a good option for many users. My old complaint was that we allowed only two very extreme options, fsync() all the time, or fsync() never and recover from a crash. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: My idea would be to let committing backends return "COMMIT" to the user, and set a need_fsync flag that is guaranteed to cause an fsync within X milliseconds. This way, if other backends commit in the next X millisecond, they can all use one fsync(). Guaranteed by what? We have no mechanism available to make an fsync happen while the backend is waiting for input. We would need a separate binary that can look at shared memory and fsync is someone requested it. Again, nothing for 7.1.X. Now, I know many will complain that we are returning commit while not having the stuff on the platter. I think that's unacceptable on its face. A remote client may take action on the basis that COMMIT was returned. If the server then crashes, the client is unlikely to realize this for some time (certainly at least one TCP timeout interval). It won't look like a "milliseconds later" situation to that client. In fact, the client might *never* realize there was a problem; what if it disconnects after getting the COMMIT? If the dbadmin thinks he doesn't need fsync before commit, he'll likely be running with fsync off anyway. For the ones who do think they need fsync, I don't believe that we get to rearrange the fsync to occur after commit. I can see someone wanting some fsync, but not take the hit. My argument is that having this ability, there would be no need to turn off fsync. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Bruce Momjian [EMAIL PROTECTED] writes: My idea would be to let committing backends return "COMMIT" to the user, and set a need_fsync flag that is guaranteed to cause an fsync within X milliseconds. This way, if other backends commit in the next X millisecond, they can all use one fsync(). Guaranteed by what? We have no mechanism available to make an fsync happen while the backend is waiting for input. Now, I know many will complain that we are returning commit while not having the stuff on the platter. I think that's unacceptable on its face. A remote client may take action on the basis that COMMIT was returned. If the server then crashes, the client is unlikely to realize this for some time (certainly at least one TCP timeout interval). It won't look like a "milliseconds later" situation to that client. In fact, the client might *never* realize there was a problem; what if it disconnects after getting the COMMIT? If the dbadmin thinks he doesn't need fsync before commit, he'll likely be running with fsync off anyway. For the ones who do think they need fsync, I don't believe that we get to rearrange the fsync to occur after commit. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
At 21:31 23/02/01 -0500, Bruce Momjian wrote: Now, I know many will complain that we are returning commit while not having the stuff on the platter. You're definitely right there. Maybe they do, but it seems the benefit of grouped fsyncs() is large enough that many will say they would rather have this option. I'd prefer to wait for a lock manager that supports timeouts and contention notification. There is one more thing. Even though the kernel says the data is on the platter, it still may not be there. Some OS's may return from fsync when the data is _queued_ to the disk, rather than actually wanting for the drive return code to say it completed. Second, some disks report back that the data is on the disk when it is actually in the disk memory buffer, not really on the disk. Basically, I am not sure how much we lose by doing the delay after returning COMMIT, and I know we gain quite a bit by enabling us to group fsync calls. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CommitDelay performance improvement
Philip Warner [EMAIL PROTECTED] writes: It may have been much earler in the debate, but has anyone checked to see what the maximum possible gains might be - or is it self-evident to people who know the code? fsync off provides an upper bound to the speed achievable from being smarter about when to fsync... I doubt that fsync-once-per-checkpoint would be much different. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Preliminary results from experimenting with an N-transactions-must-be-running-to-cause-commit-delay heuristic are attached. It seems to be a pretty definite win. I'm currently running a more extensive set of cases on another machine for comparison. The test case is pgbench, unmodified, but run at scalefactor 10 to reduce write contention on the 'branch' rows. Postmaster parameters are -N 100 -B 1024 in all cases. The fsync-off (with, of course, no commit delay either) case is shown for comparison. "commit siblings" is the number of other backends that must be running active (unblocked, at least one XLOG entry made) transactions before we will do a precommit delay. commit delay=1 is effectively commit delay=1 (10msec) on this hardware. Interestingly, it seems that we can push the delay up to two or three clock ticks without degradation, given positive N. regards, tom lane hppabench.gif
Re: [HACKERS] CommitDelay performance improvement
[EMAIL PROTECTED] (Nathan Myers) writes: I see, I had it backwards: N=0 corresponds to "always delay", and N=infinity (~0) is "never delay", or what you call zero delay. N=1 is not interesting. N=M/2 or N=sqrt(M) or N=log(M) might be interesting, where M is the number of backends, or the number of backends with begun transactions, or something. N=10 would be conservative (and maybe pointless) just because it would hardly ever trigger a delay. Why is N=1 not interesting? That requires at least one other backend to be in a transaction before you'll delay. That would seem to be the minimum useful value --- N=0 (always delay) seems clearly to be too stupid to be useful. regards, tom lane
Re: [HACKERS] CommitDelay performance improvement
Philip Warner [EMAIL PROTECTED] writes: It may have been much earler in the debate, but has anyone checked to see what the maximum possible gains might be - or is it self-evident to people who know the code? fsync off provides an upper bound to the speed achievable from being smarter about when to fsync... I doubt that fsync-once-per-checkpoint would be much different. That was my point, people should be doing fsync once per checkpoint rather than never. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026