Re: [HACKERS] checkpointer continuous flushing
To emphasize potential bad effects without having to build too large a host and involve too many table spaces, I would suggest to reduce significantly the "checkpoint_flush_after" setting while running these tests. Meh, that completely distorts the test. Yep, I agree. The point would be to show whether there is a significant impact, or not, with less hardware & cost involved in the test. Now if you can put 16 disks with 16 table spaces with 16 buffers per bucket, that is good, fine with me! I'm just trying to point out that you could probably get comparable relative results with 4 disks, 4 tables spaces and 4 buffers per bucket, so it is an alternative and less expensive testing strategy. This just shows that I usually work on a tight (negligeable?) budget:-) -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
My impression is that we actually know what we need to know anyway? Sure, the overall summary is "it is much better with the patch" on this large SSD test, which is good news because the patch was really designed to help with HDDs. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
You took 5% of the tx on two 12 hours runs, totaling say 85M tx on one and 100M tx on the other, so you get 4.25M tx from the first and 5M from the second. OK I'm saying that the percentile should be computed on the largest one (5M), so that you get a curve like the following, with both curve having the same transaction density on the y axis, so the second one does not go up to the top, reflecting that in this case less transactions where processed. Huh, that seems weird. That's not how percentiles or CDFs work, and I don't quite understand what would that tell us. It would tell us that for a given transaction number (in the latency-ordered list) whether its latency is above or below the other run. I think it would probably show that the latency is always better for the patched version by getting rid of the crossing which has no meaning and seems to suggest, wrongly, that in some case the other is better than the first, but as the y axis of both curves are not in the same unit (not same transaction density) this is just an illusion implied by a misplaced normalization. So I'm basically saying that the y axis should be just the transaction number, not a percent. Anyway, these are just details, your figures show that the patch is a very significant win on SSDs, all is well! -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-03-22 10:52:55 +0100, Fabien COELHO wrote: > To emphasize potential bad effects without having to build too large a host > and involve too many table spaces, I would suggest to reduce significantly > the "checkpoint_flush_after" setting while running these tests. Meh, that completely distorts the test. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
WRT tablespaces: What I'm planning to do, unless somebody has a better proposal, is to basically rent two big amazon instances, and run pgbench in parallel over N tablespaces. Once with local SSD and once with local HDD storage. Ok. Not sure how to control that table spaces are actually on distinct dedicated disks with VMs, but this is the idea. To emphasize potential bad effects without having to build too large a host and involve too many table spaces, I would suggest to reduce significantly the "checkpoint_flush_after" setting while running these tests. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-03-22 10:48:20 +0100, Tomas Vondra wrote: > Hi, > > On 03/22/2016 10:44 AM, Fabien COELHO wrote: > > > > > 1) regular-latency.png > >>> > >>>I'm wondering whether it would be clearer if the percentiles > >>>where relative to the largest sample, not to itself, so that the > >>>figures from the largest one would still be between 0 and 1, but > >>>the other (unpatched) one would go between 0 and 0.85, that is > >>>would be cut short proportionnaly to the actual performance. > >> > >>I'm not sure what you mean by 'relative to largest sample'? > > > >You took 5% of the tx on two 12 hours runs, totaling say 85M tx on > >one and 100M tx on the other, so you get 4.25M tx from the first and > >5M from the second. > > OK > > >I'm saying that the percentile should be computed on the largest one > >(5M), so that you get a curve like the following, with both curve > >having the same transaction density on the y axis, so the second one > >does not go up to the top, reflecting that in this case less > >transactions where processed. > > Huh, that seems weird. That's not how percentiles or CDFs work, and I don't > quite understand what would that tell us. My impression is that we actually know what we need to know anyway? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, On 03/22/2016 10:44 AM, Fabien COELHO wrote: 1) regular-latency.png I'm wondering whether it would be clearer if the percentiles where relative to the largest sample, not to itself, so that the figures from the largest one would still be between 0 and 1, but the other (unpatched) one would go between 0 and 0.85, that is would be cut short proportionnaly to the actual performance. I'm not sure what you mean by 'relative to largest sample'? You took 5% of the tx on two 12 hours runs, totaling say 85M tx on one and 100M tx on the other, so you get 4.25M tx from the first and 5M from the second. OK I'm saying that the percentile should be computed on the largest one (5M), so that you get a curve like the following, with both curve having the same transaction density on the y axis, so the second one does not go up to the top, reflecting that in this case less transactions where processed. Huh, that seems weird. That's not how percentiles or CDFs work, and I don't quite understand what would that tell us. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
1) regular-latency.png I'm wondering whether it would be clearer if the percentiles where relative to the largest sample, not to itself, so that the figures from the largest one would still be between 0 and 1, but the other (unpatched) one would go between 0 and 0.85, that is would be cut short proportionnaly to the actual performance. I'm not sure what you mean by 'relative to largest sample'? You took 5% of the tx on two 12 hours runs, totaling say 85M tx on one and 100M tx on the other, so you get 4.25M tx from the first and 5M from the second. I'm saying that the percentile should be computed on the largest one (5M), so that you get a curve like the following, with both curve having the same transaction density on the y axis, so the second one does not go up to the top, reflecting that in this case less transactions where processed. A +- # up to 100% | / ___ # cut short | | / | | | | _/ / |/__/ +-> -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, On 2016-03-21 18:46:58 +0100, Tomas Vondra wrote: > I've repeated the tests, but this time logged details for 5% of the > transaction (instead of aggregating the data for each second). I've also > made the tests shorter - just 12 hours instead of 24, to reduce the time > needed to complete the benchmark. > > Overall, this means ~300M transactions in total for the un-throttled case, > so sample with ~15M transactions available when computing the following > charts. > > I've used the same commits as during the previous testing, i.e. a298a1e0 > (before patches) and 23a27b03 (with patches). > > One interesting difference is that while the "patched" version resulted in > slightly better performance (8122 vs. 8000 tps), the "unpatched" version got > considerably slower (6790 vs. 7725 tps) - that's ~13% difference, so not > negligible. Not sure what's the cause - the configuration was exactly the > same, there's nothing in the log and the machine was dedicated to the > testing. The only explanation I have is that the unpatched code is a bit > more unstable when it comes to this type of stress testing. > > There results (including scripts for generating the charts) are here: > > https://github.com/tvondra/flushing-benchmark-2 > > Attached are three charts - again, those are using CDF to illustrate the > distributions and compare them easily: > > 1) regular-latency.png > > The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter > transactions, the old code is slightly faster (i.e. apparently there's some > per-transaction overhead). For higher latencies though, the patched code is > clearly winning - there are far fewer transactions over 6ms, which makes a > huge difference. (Notice the x-axis is actually log-scale, so the tail on > the old code is actually much longer than it might appear.) > > 2) throttled-latency.png > > In the throttled case (i.e. when the system is not 100% utilized, so it's > more representative of actual production use), the difference is quite > clearly in favor of the new code. > > 3) throttled-schedule-lag.png > > Mostly just an alternative view on the previous chart, showing how much > later the transactions were scheduled. Again, the new code is winning. Thanks for running these tests! I think this shows that we're in a good shape, and that the commits succeeded in what they were attempting. Very glad to hear that. WRT tablespaces: What I'm planning to do, unless somebody has a better proposal, is to basically rent two big amazon instances, and run pgbench in parallel over N tablespaces. Once with local SSD and once with local HDD storage. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, On 03/22/2016 07:35 AM, Fabien COELHO wrote: Hello Tomas, Thanks again for these interesting benches. Overall, this means ~300M transactions in total for the un-throttled case, so sample with ~15M transactions available when computing the following charts. Still a very sizable run! There results (including scripts for generating the charts) are here: https://github.com/tvondra/flushing-benchmark-2 This repository seems empty. Strange. Apparently I forgot to push, or maybe it did not complete before I closed the terminal. Anyway, pushing now (it'll take a bit more time to complete). 1) regular-latency.png I'm wondering whether it would be clearer if the percentiles where relative to the largest sample, not to itself, so that the figures from the largest one would still be between 0 and 1, but the other (unpatched) one would go between 0 and 0.85, that is would be cut short proportionnaly to the actual performance. I'm not sure what you mean by 'relative to largest sample'? The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter transactions, the old code is slightly faster (i.e. apparently there's some per-transaction overhead). I'm not sure how meaningfull is the crossing, because both curves do not reflect the same performance. I think that they may not cross at all if the normalization is with the same reference, i.e. the better run. Well, I think the curves illustrate exactly the performance difference, because with the old code the percentiles after p=0.85 get much higher. Which is the point of the crossing, although I agree the exact point does not have a particular meaning. 2) throttled-latency.png In the throttled case (i.e. when the system is not 100% utilized, so it's more representative of actual production use), the difference is quite clearly in favor of the new code. Indeed, it is a no brainer. Yep. 3) throttled-schedule-lag.png Mostly just an alternative view on the previous chart, showing how much later the transactions were scheduled. Again, the new code is winning. No brainer again. I infer from this figure that with the initial version 60% of transactions have trouble being processed on time, while this is maybe about 35% with the new version. Yep. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Tomas, Thanks again for these interesting benches. Overall, this means ~300M transactions in total for the un-throttled case, so sample with ~15M transactions available when computing the following charts. Still a very sizable run! There results (including scripts for generating the charts) are here: https://github.com/tvondra/flushing-benchmark-2 This repository seems empty. 1) regular-latency.png I'm wondering whether it would be clearer if the percentiles where relative to the largest sample, not to itself, so that the figures from the largest one would still be between 0 and 1, but the other (unpatched) one would go between 0 and 0.85, that is would be cut short proportionnaly to the actual performance. The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter transactions, the old code is slightly faster (i.e. apparently there's some per-transaction overhead). I'm not sure how meaningfull is the crossing, because both curves do not reflect the same performance. I think that they may not cross at all if the normalization is with the same reference, i.e. the better run. 2) throttled-latency.png In the throttled case (i.e. when the system is not 100% utilized, so it's more representative of actual production use), the difference is quite clearly in favor of the new code. Indeed, it is a no brainer. 3) throttled-schedule-lag.png Mostly just an alternative view on the previous chart, showing how much later the transactions were scheduled. Again, the new code is winning. No brainer again. I infer from this figure that with the initial version 60% of transactions have trouble being processed on time, while this is maybe about 35% with the new version. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, I've repeated the tests, but this time logged details for 5% of the transaction (instead of aggregating the data for each second). I've also made the tests shorter - just 12 hours instead of 24, to reduce the time needed to complete the benchmark. Overall, this means ~300M transactions in total for the un-throttled case, so sample with ~15M transactions available when computing the following charts. I've used the same commits as during the previous testing, i.e. a298a1e0 (before patches) and 23a27b03 (with patches). One interesting difference is that while the "patched" version resulted in slightly better performance (8122 vs. 8000 tps), the "unpatched" version got considerably slower (6790 vs. 7725 tps) - that's ~13% difference, so not negligible. Not sure what's the cause - the configuration was exactly the same, there's nothing in the log and the machine was dedicated to the testing. The only explanation I have is that the unpatched code is a bit more unstable when it comes to this type of stress testing. There results (including scripts for generating the charts) are here: https://github.com/tvondra/flushing-benchmark-2 Attached are three charts - again, those are using CDF to illustrate the distributions and compare them easily: 1) regular-latency.png The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter transactions, the old code is slightly faster (i.e. apparently there's some per-transaction overhead). For higher latencies though, the patched code is clearly winning - there are far fewer transactions over 6ms, which makes a huge difference. (Notice the x-axis is actually log-scale, so the tail on the old code is actually much longer than it might appear.) 2) throttled-latency.png In the throttled case (i.e. when the system is not 100% utilized, so it's more representative of actual production use), the difference is quite clearly in favor of the new code. 3) throttled-schedule-lag.png Mostly just an alternative view on the previous chart, showing how much later the transactions were scheduled. Again, the new code is winning. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Tomas, Thanks for these great measures. * 4 x CPU E5-4620 (2.2GHz) 4*8 = 32 cores / 64 threads. * 256GB of RAM Wow! * 24x SSD on LSI 2208 controller (with 1GB BBWC) Wow! RAID configuration ? The patch is designed to fix very big issues on HDD, but it is good to see that the impact is good on SSD as well. Is it possible to run tests with distinct table spaces on those many disks? * shared_buffers=64GB 1/4 of the available memory. The pgbench was scale 6, so ~750GB of data on disk, *3 available memory, mostly on disk. or like this ("throttled"): pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench The reason for the throttling is that people generally don't run production databases 100% saturated, so it'd be sad to improve the 100% saturated case and hurt the common case by increasing latency. Sure. The machine does ~8000 tps, so 5000 tps is ~60% of that. Ok. I would have suggested using the --latency-limit option to filter out very slow queries, otherwise if the system is stuck it may catch up later, but then this is not representative of "sustainable" performance. When pgbench is running under a target rate, in both runs the transaction distribution is expected to be the same, around 5000 tps, and the green run looks pretty ok with respect to that. The magenta one shows that about 25% of the time, things are not good at all, and the higher figures just show the catching up, which is not really interesting if you asked for a web page and it is finally delivered 1 minutes later. * regular-tps.png (per-second TPS) [...] Great curves! consistent. Originally there was ~10% of samples with ~2000 tps, but with the flushing you'd have to go to ~4600 tps. It's actually pretty difficult to determine this from the chart, because the curve got so steep and I had to check the data used to generate the charts. Similarly for the upper end, but I assume that's a consequence of the throttling not having to compensate for the "slow" seconds anymore. Yep, but they should be filtered out, "sorry, too late", so that would count as unresponsisveness, at least for a large class of applications. Thanks a lot for there interesting tests! -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, On 03/17/2016 10:14 PM, Fabien COELHO wrote: ... I would have suggested using the --latency-limit option to filter out very slow queries, otherwise if the system is stuck it may catch up later, but then this is not representative of "sustainable" performance. When pgbench is running under a target rate, in both runs the transaction distribution is expected to be the same, around 5000 tps, and the green run looks pretty ok with respect to that. The magenta one shows that about 25% of the time, things are not good at all, and the higher figures just show the catching up, which is not really interesting if you asked for a web page and it is finally delivered 1 minutes later. Maybe. But that'd only increase the stress on the system, possibly causing more issues, no? And the magenta line is the old code, thus it would only increase the improvement of the new code. Yes and no. I agree that it stresses the system a little more, but the fact that you have 5000 tps in the end does not show that you can really sustain 5000 tps with reasonnable latency. I find this later information more interesting than knowing that you can get 5000 tps on average, thanks to some catching up. Moreover the non throttled runs already shown that the system could do 8000 tps, so the bandwidth is already there. Sure, but thanks to the tps charts we *do know* that for vast majority of the intervals (each second) the number of completed transactions is very close to 5000. And that wouldn't be possible if large part of the latencies were close to the maximums. With 5000 tps and 32 clients, that means the average latency should be less than 6ms, otherwise the clients couldn't make ~160 tps each. But we do see that the maximum latency for most intervals is way higher. Only ~10% of the intervals have max latency below 10ms, for example. Notice the max latency is in microseconds (as logged by pgbench), so according to the "max latency" charts the latencies are below 10 seconds (old) and 1 second (new) about 99% of the time. AFAICS, the max latency is aggregated by second, but then it does not say much about the distribution of individuals latencies in the interval, that is whether they were all close to the max or not, Having the same chart with median or average might help. Also, with the stddev chart, the percent do not correspond with the latency one, so it may be that the latency is high but the stddev is low, i.e. all transactions are equally bad on the interval, or not. > So I must admit that I'm not clear at all how to interpret the max latency & stddev charts you provided. You're right those charts are not describing distributions of the latencies but those aggregated metrics. And it's not particularly simple to deduce information about the source statistics, for example because all the intervals have the same "weight" although the number of transactions that completed in each interval may be different. But I do think it's a very useful tool when it comes to measuring the consistency of behavior over time, assuming you're asking questions about the intervals and not the original transactions. For example, had there been intervals with vastly different transaction rates, we'd see that on the tps charts (i.e. the chart would be much more gradual or wobbly, just like the "unpatched" one). Or if there were intervals with much higher variance of latencies, we'd see that on the STDDEV chart. I'll consider repeating the benchmark and logging some reasonable sample of transactions - for the 24h run the unthrottled benchmark did ~670M transactions. Assuming ~30B per line, that's ~20GB, so 5% sample should be ~1GB of data, which I think is enough. But of course, that's useful for answering questions about distribution of the individual latencies in global, not about consistency over time. So I don't think this would make any measurable difference in practice. I think that it may show that 25% of the time the system could not match the target tps, even if it can handle much more on average, so the tps achieved when discarding late transactions would be under 4000 tps. You mean the 'throttled-tps' chart? Yes, that one shows that without the patches, there's a lot of intervals where the tps was much lower - presumably due to a lot of slow transactions. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Tomas, But I do think it's a very useful tool when it comes to measuring the consistency of behavior over time, assuming you're asking questions about the intervals and not the original transactions. For a throttled run, I think it is better to check whether or not the system could handle the load "as expected", i.e. with reasonnable latency, so somehow I'm interested in the "original transactions" as scheduled by the client, and whether they were processed efficiently, but then it must be aggregated by interval to get some statistics. For example, had there been intervals with vastly different transaction rates, we'd see that on the tps charts (i.e. the chart would be much more gradual or wobbly, just like the "unpatched" one). Or if there were intervals with much higher variance of latencies, we'd see that on the STDDEV chart. On HDDs what happens is that transactions are "blocked/freezed", the tps is very low, the latency very high, but then with few tx (even 1 or 0 at time) and all latencies very bad but nevertheless close one to the other, in a bad way, the resulting stddev may be quite small anyway. I'll consider repeating the benchmark and logging some reasonable sample of transactions Beware that this measure is skewed, because on HDDs when the system is stuck, it is stuck on very few transactions which are waiting, but they would seldom show on statistics are there are very few of them. That is why I'm interested in those that could not make it, hence my interest in --latency-limit option which just say that. So I don't think this would make any measurable difference in practice. I think that it may show that 25% of the time the system could not match the target tps, even if it can handle much more on average, so the tps achieved when discarding late transactions would be under 4000 tps. You mean the 'throttled-tps' chart? Yes. Yes, that one shows that without the patches, there's a lot of intervals where the tps was much lower - presumably due to a lot of slow transactions. Yep. That is what is measured with the latency limit option, by counting the dropped transactions that where not processed in a timely maner. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Is it possible to run tests with distinct table spaces on those many disks? Nope, that'd require reconfiguring the system (and then back), and I don't have access to that system (just SSH). Ok. Also, I don't quite see what would that tell us? Currently the flushing context is shared between table space, but I think that it should be per table space. My tests did not manage to convince Andres, so getting some more figures would be great. That will be another time! I would have suggested using the --latency-limit option to filter out very slow queries, otherwise if the system is stuck it may catch up later, but then this is not representative of "sustainable" performance. When pgbench is running under a target rate, in both runs the transaction distribution is expected to be the same, around 5000 tps, and the green run looks pretty ok with respect to that. The magenta one shows that about 25% of the time, things are not good at all, and the higher figures just show the catching up, which is not really interesting if you asked for a web page and it is finally delivered 1 minutes later. Maybe. But that'd only increase the stress on the system, possibly causing more issues, no? And the magenta line is the old code, thus it would only increase the improvement of the new code. Yes and no. I agree that it stresses the system a little more, but the fact that you have 5000 tps in the end does not show that you can really sustain 5000 tps with reasonnable latency. I find this later information more interesting than knowing that you can get 5000 tps on average, thanks to some catching up. Moreover the non throttled runs already shown that the system could do 8000 tps, so the bandwidth is already there. Notice the max latency is in microseconds (as logged by pgbench), so according to the "max latency" charts the latencies are below 10 seconds (old) and 1 second (new) about 99% of the time. AFAICS, the max latency is aggregated by second, but then it does not say much about the distribution of individuals latencies in the interval, that is whether they were all close to the max or not, Having the same chart with median or average might help. Also, with the stddev chart, the percent do not correspond with the latency one, so it may be that the latency is high but the stddev is low, i.e. all transactions are equally bad on the interval, or not. So I must admit that I'm not clear at all how to interpret the max latency & stddev charts you provided. So I don't think this would make any measurable difference in practice. I think that it may show that 25% of the time the system could not match the target tps, even if it can handle much more on average, so the tps achieved when discarding late transactions would be under 4000 tps. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, On 03/11/2016 02:34 AM, Andres Freund wrote: Hi, I just pushed the two major remaining patches in this thread. Let's see what the buildfarm has to say; I'd not be surprised if there's some lingering portability problem in the flushing code. There's one remaining issue we definitely want to resolve before the next release: Right now we always use one writeback context across all tablespaces in a checkpoint, but Fabien's testing shows that that's likely to hurt in a number of cases. I've some data suggesting the contrary in others. Things that'd be good: * Some benchmarking. Right now controlled flushing is enabled by default on linux, but disabled by default on other operating systems. Somebody running benchmarks on e.g. freebsd or OSX might be good. So I've done some benchmarks of this, and I think the results are very good. I've compared a298a1e06 and 23a27b039d (so the two patches mentioned here are in-between those two), and I've done a few long pgbench runs - 24h each: 1) master (a298a1e06), regular pgbench 2) master (a298a1e06), throttled to 5000 tps 3) patched (23a27b039), regular pgbench 3) patched (23a27b039), throttled to 5000 tps All of this was done on a quite large machine: * 4 x CPU E5-4620 (2.2GHz) * 256GB of RAM * 24x SSD on LSI 2208 controller (with 1GB BBWC) The page cache was using the default config, although in production setups we'd probably lower the limits (particularly the background threshold): * vm.dirty_background_ratio = 10 * vm.dirty_ratio = 20 The main PostgreSQL configuration changes are these: * shared_buffers=64GB * bgwriter_delay = 10ms * bgwriter_lru_maxpages = 1000 * checkpoint_timeout = 30min * max_wal_size = 64GB * min_wal_size = 32GB I haven't touched the flush_after values, so those are at default. Full config in the github repo, along with all the results and scripts used to generate the charts etc: https://github.com/tvondra/flushing-benchmark I'd like to see some benchmarks on machines with regular rotational storage, but I don't have a suitable system at hand. The pgbench was scale 6, so ~750GB of data on disk, and was executed either like this (the "default"): pgbench -c 32 -j 8 -T 86400 -l --aggregate-interval=1 pgbench or like this ("throttled"): pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench The reason for the throttling is that people generally don't run production databases 100% saturated, so it'd be sad to improve the 100% saturated case and hurt the common case by increasing latency. The machine does ~8000 tps, so 5000 tps is ~60% of that. It's difficult to judge based on a single run (although a long one), but it seems the throughput increased a tiny bit from 7725 to 8000. That's ~4% difference, but I guess more runs would be needed to see if this is noise or actual improvement. Now, let's see at the per-second results, i.e. how much the performance fluctuates over time (due to checkpoints etc.). That's where the aggregated log (per-second) gets useful, as it's used for generating the various charts for tps, max latency, stddev of latency etc. All those charts are CDF, i.e. cumulative distribution function, i.e. they plot a metric on x-axis, and probability P(X <= x) on y-axis. In general the steeper the curve the better (more consistent behavior over time). It also allows comparing two curves - e.g. for tps metric the "lower" curve is better, as it means higher values are more likely. default (non-throttled) pgbench runs Let's see the regular (non-throttled) pgbench runs first: * regular-tps.png (per-second TPS) Clearly, the patched version is much more consistent - firstly it's much less "wobbly" and it's considerably steeper, which means the per-second throughput fluctuates much less. That's good. We already know the total throughput is almost exactly the same (just 4% difference), this also shows that the medians are almost exactly the same (the curves intersect at pretty much exactly 50%). * regular-max-lat.png (per-second maximum latency) * regular-stddev-lat.png (per-second latency stddev) Apparently the additional processing slightly increases both the maximum latency and standard deviation, as the green line (patched) is consistently below the pink one (unpatched). Notice however that x-axis is using log scale, so the differences are actually very small, and we also know that the total throughput slightly increased. So while those two metrics slightly increased, the overall impact on latency has to be positive. throttled pgbench runs -- * throttled-tps.png (per-second TPS) OK, this is great - the chart shows that the performance is way more consistent. Originally there was ~10% of samples with ~2000 tps, but with the flushing you'd have to go to ~4600 tps. It's actually pretty difficult to determine this from the chart, because the curve got so steep and I had to
Re: [HACKERS] checkpointer continuous flushing
Hi, On 03/17/2016 06:36 PM, Fabien COELHO wrote: Hello Tomas, Thanks for these great measures. * 4 x CPU E5-4620 (2.2GHz) 4*8 = 32 cores / 64 threads. Yep. I only used 32 clients though, to keep some of the CPU available for the rest of the system (also, HT does not really double the number of cores). * 256GB of RAM Wow! * 24x SSD on LSI 2208 controller (with 1GB BBWC) Wow! RAID configuration ? The patch is designed to fix very big issues on HDD, but it is good to see that the impact is good on SSD as well. Yep, RAID-10. I agree that doing the test on a HDD-based system would be useful, however (a) I don't have a comparable system at hand at the moment, and (b) I was a bit worried that it'll hurt performance on SSDs, but thankfully that's not the case. I will do the test on a much smaller system with HDDs in a few days. Is it possible to run tests with distinct table spaces on those many disks? Nope, that'd require reconfiguring the system (and then back), and I don't have access to that system (just SSH). Also, I don't quite see what would that tell us? * shared_buffers=64GB 1/4 of the available memory. The pgbench was scale 6, so ~750GB of data on disk, *3 available memory, mostly on disk. or like this ("throttled"): pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench The reason for the throttling is that people generally don't run production databases 100% saturated, so it'd be sad to improve the 100% saturated case and hurt the common case by increasing latency. Sure. The machine does ~8000 tps, so 5000 tps is ~60% of that. Ok. I would have suggested using the --latency-limit option to filter out very slow queries, otherwise if the system is stuck it may catch up later, but then this is not representative of "sustainable" performance. When pgbench is running under a target rate, in both runs the transaction distribution is expected to be the same, around 5000 tps, and the green run looks pretty ok with respect to that. The magenta one shows that about 25% of the time, things are not good at all, and the higher figures just show the catching up, which is not really interesting if you asked for a web page and it is finally delivered 1 minutes later. Maybe. But that'd only increase the stress on the system, possibly causing more issues, no? And the magenta line is the old code, thus it would only increase the improvement of the new code. Notice the max latency is in microseconds (as logged by pgbench), so according to the "max latency" charts the latencies are below 10 seconds (old) and 1 second (new) about 99% of the time. So I don't think this would make any measurable difference in practice. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 3/13/16 6:30 PM, Peter Geoghegan wrote: On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janeswrote: Would the wiki be a good place for such tips? Not as formal as the documentation, and more centralized (and editable) than a collection of blog posts. That general direction makes sense, but I'm not sure if the Wiki is something that this will work for. I fear that it could become something like the TODO list page: a page that contains theoretically accurate information, but isn't very helpful. The TODO list needs to be heavily pruned, but that seems like something that will never happen. A centralized location for performance tips will probably only work well if there are still high standards that are actively enforced. There still needs to be tight editorial control. I think there's ways to significantly restrict who can edit a page, so this could probably still be done via the wiki. IMO we should also be encouraging users to test various tips and provide feedback, so maybe a wiki page with a big fat request at the top asking users to submit any feedback about the page to -performance. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janeswrote: > Would the wiki be a good place for such tips? Not as formal as the > documentation, and more centralized (and editable) than a collection > of blog posts. That general direction makes sense, but I'm not sure if the Wiki is something that this will work for. I fear that it could become something like the TODO list page: a page that contains theoretically accurate information, but isn't very helpful. The TODO list needs to be heavily pruned, but that seems like something that will never happen. A centralized location for performance tips will probably only work well if there are still high standards that are actively enforced. There still needs to be tight editorial control. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Thu, Mar 10, 2016 at 11:25 PM, Peter Geogheganwrote: > On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO wrote: >> I can only concur! >> >> The "Performance Tips" chapter (II.14) is more user/query oriented. The >> "Server Administration" bool (III) does not discuss this much. > > That's definitely one area in which the docs are lacking -- I've heard > several complaints about this myself. I think we've been hesitant to > do more in part because the docs must always be categorically correct, > and must not use weasel words. I think it's hard to talk about > performance while maintaining the general tone of the documentation. I > don't know what can be done about that. Would the wiki be a good place for such tips? Not as formal as the documentation, and more centralized (and editable) than a collection of blog posts. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHOwrote: > I can only concur! > > The "Performance Tips" chapter (II.14) is more user/query oriented. The > "Server Administration" bool (III) does not discuss this much. That's definitely one area in which the docs are lacking -- I've heard several complaints about this myself. I think we've been hesitant to do more in part because the docs must always be categorically correct, and must not use weasel words. I think it's hard to talk about performance while maintaining the general tone of the documentation. I don't know what can be done about that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
I just pushed the two major remaining patches in this thread. Hurray! Nine months the this baby out:-) -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
As you wish. I thought that understanding the underlying performance model with sequential writes written in chunks is important for the admin, and as this guc would have an impact on performance it should be hinted about, including the limits of its effect where large bases will converge to random io performance. But maybe that is not the right place. I do agree that that's something interesting to document somewhere. But I don't think any of the current places in the documentation are a good fit, and it's a topic much more general than the feature we're debating here. I'm not volunteering, but a good discussion of storage and the interactions with postgres surely would be a significant improvement to the postgres docs. I can only concur! The "Performance Tips" chapter (II.14) is more user/query oriented. The "Server Administration" bool (III) does not discuss this much. There is a wiki about performance tuning, but it is not integrated into the documentation. It could be a first documentation source. Also the README in some development directories are very interesting, although they contains too much details about the implementation. There has been a lot of presentations over the years, and blog posts. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi, I just pushed the two major remaining patches in this thread. Let's see what the buildfarm has to say; I'd not be surprised if there's some lingering portability problem in the flushing code. There's one remaining issue we definitely want to resolve before the next release: Right now we always use one writeback context across all tablespaces in a checkpoint, but Fabien's testing shows that that's likely to hurt in a number of cases. I've some data suggesting the contrary in others. Things that'd be good: * Some benchmarking. Right now controlled flushing is enabled by default on linux, but disabled by default on other operating systems. Somebody running benchmarks on e.g. freebsd or OSX might be good. * If somebody has the energy to provide a windows implemenation for flush control, that might be worthwhile. There's several places that could benefit from that. * The default values are basically based on benchmarking by me and Fabien. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote: > As you wish. I thought that understanding the underlying performance model > with sequential writes written in chunks is important for the admin, and as > this guc would have an impact on performance it should be hinted about, > including the limits of its effect where large bases will converge to random > io performance. But maybe that is not the right place. I do agree that that's something interesting to document somewhere. But I don't think any of the current places in the documentation are a good fit, and it's a topic much more general than the feature we're debating here. I'm not volunteering, but a good discussion of storage and the interactions with postgres surely would be a significant improvement to the postgres docs. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] If the default is in pages, maybe you could state it and afterwards translate it in size. Hm, I think that's more complicated for users than it's worth. As you wish. I liked the number of pages you used initially because it really gives a hint of how much random IOs are avoided when they are contiguous, and I do not have the same just intuition with sizes. Also it is related to the io queue length manage by the OS. The text could say something about sequential writes performance because pages are sorted.., but that it is lost for large bases and/or short checkpoints ? I think that's an implementation detail. As you wish. I thought that understanding the underlying performance model with sequential writes written in chunks is important for the admin, and as this guc would have an impact on performance it should be hinted about, including the limits of its effect where large bases will converge to random io performance. But maybe that is not the right place. -- Fabien -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, I'm not sure I've seen these performance... If you have hard evidence, please feel free to share it. Man, are you intentionally trying to be hard to work with? Sorry, I do not understand this remark. You were refering to some latency measures in your answer, and I was just stating that I was interested in seeing these figures which were used to justify your choice to keep a shared writeback context. I did not intend this wish to be an issue, I was expressing an interest. To quote the email you responded to: My current plan is to commit this with the current behaviour (as in this week[end]), and then do some actual benchmarking on this specific part. It's imo a relatively minor detail. Good. From the evidence in the thread, I would have given the per tablespace context the preference, but this is just a personal opinion and I agree that it can work the other way around. I look forward to see these benchmarks later on, when you have them. So all is well, and hopefully will be even better later on. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote: > > > > >Whenever more than bgwriter_flush_after bytes have > >been written by the bgwriter, attempt to force the OS to issue these > >writes to the underlying storage. Doing so will limit the amount of > >dirty data in the kernel's page cache, reducing the likelihood of > >stalls when an fsync is issued at the end of a checkpoint, or when > >the OS writes data back in larger batches in the background. Often > >that will result in greatly reduced transaction latency, but there > >also are some cases, especially with workloads that are bigger than > >, but smaller than the OS's page > >cache, where performance might degrade. This setting may have no > >effect on some platforms. 0 disables controlled > >writeback. The default is 256Kb on Linux, 0 > >otherwise. This parameter can only be set in the > >postgresql.conf file or on the server command line. > > > > > >(plus adjustments for the other gucs) > What about the maximum value? Added. bgwriter_flush_after (int) bgwriter_flush_after configuration parameter Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. The valid range is between 0, which disables controlled writeback, and 2MB. The default is 256Kb on Linux, 0 elsewhere. (Non-default values of BLCKSZ change the default and maximum.) This parameter can only be set in the postgresql.conf file or on the server command line. > If the default is in pages, maybe you could state it and afterwards > translate it in size. Hm, I think that's more complicated for users than it's worth. > The text could say something about sequential writes performance because > pages are sorted.., but that it is lost for large bases and/or short > checkpoints ? I think that's an implementation detail. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. 0 disables controlled writeback. The default is 256Kb on Linux, 0 otherwise. This parameter can only be set in the postgresql.conf file or on the server command line. (plus adjustments for the other gucs) Some suggestions: What about the maximum value? If the default is in pages, maybe you could state it and afterwards translate it in size. "The default is 64 pages on Linux (usually 256Kb)..." The text could say something about sequential writes performance because pages are sorted.., but that it is lost for large bases and/or short checkpoints ? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote: > I'm not sure I've seen these performance... If you have hard evidence, > please feel free to share it. Man, are you intentionally trying to be hard to work with? To quote the email you responded to: > My current plan is to commit this with the current behaviour (as in this > week[end]), and then do some actual benchmarking on this specific > part. It's imo a relatively minor detail. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] I had originally kept it with one context per tablespace after refactoring this, but found that it gave worse results in rate limited loads even over only two tablespaces. That's on SSDs though. Might just mean that a smaller context size is better on SSD, and it could still be better per table space. The number of pages still in writeback (i.e. for which sync_file_range has been issued, but which haven't finished running yet) at the end of the checkpoint matters for the latency hit incurred by the fsync()s from smgrsync(); at least by my measurement. I'm not sure I've seen these performance... If you have hard evidence, please feel free to share it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-10 17:33:33 -0500, Robert Haas wrote: > On Thu, Mar 10, 2016 at 5:24 PM, Andres Freundwrote: > > On 2016-02-21 09:49:53 +0530, Robert Haas wrote: > >> I think there might be a semantic distinction between these two terms. > >> Doesn't writeback mean writing pages to disk, and flushing mean making > >> sure that they are durably on disk? So for example when the Linux > >> kernel thinks there is too much dirty data, it initiates writeback, > >> not a flush; on the other hand, at transaction commit, we initiate a > >> flush, not writeback. > > > > I don't think terminology is sufficiently clear to make such a > > distinction. Take e.g. our FlushBuffer()... > > Well then we should clarify it! Trying that as we speak, err, write. How about: Whenever more than bgwriter_flush_after bytes have been written by the bgwriter, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than , but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. 0 disables controlled writeback. The default is 256Kb on Linux, 0 otherwise. This parameter can only be set in the postgresql.conf file or on the server command line. (plus adjustments for the other gucs) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On Thu, Mar 10, 2016 at 5:24 PM, Andres Freundwrote: > On 2016-02-21 09:49:53 +0530, Robert Haas wrote: >> I think there might be a semantic distinction between these two terms. >> Doesn't writeback mean writing pages to disk, and flushing mean making >> sure that they are durably on disk? So for example when the Linux >> kernel thinks there is too much dirty data, it initiates writeback, >> not a flush; on the other hand, at transaction commit, we initiate a >> flush, not writeback. > > I don't think terminology is sufficiently clear to make such a > distinction. Take e.g. our FlushBuffer()... Well then we should clarify it! :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-21 09:49:53 +0530, Robert Haas wrote: > I think there might be a semantic distinction between these two terms. > Doesn't writeback mean writing pages to disk, and flushing mean making > sure that they are durably on disk? So for example when the Linux > kernel thinks there is too much dirty data, it initiates writeback, > not a flush; on the other hand, at transaction commit, we initiate a > flush, not writeback. I don't think terminology is sufficiently clear to make such a distinction. Take e.g. our FlushBuffer()... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote: > > >>>Now I cannot see how having one context per table space would have a > >>>significant negative performance impact. > >> > >>The 'dirty data' etc. limits are global, not per block device. By having > >>several contexts with unflushed dirty data the total amount of dirty > >>data in the kernel increases. > > > >Possibly, but how much? Do you have experimental data to back up that > >this is really an issue? > > > >We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB > >of dirty buffers to manage for 16 table spaces, I do not see that as a > >major issue for the kernel. We flush in those increments, that doesn't mean there's only that much dirty data. I regularly see one order of magnitude more being dirty. I had originally kept it with one context per tablespace after refactoring this, but found that it gave worse results in rate limited loads even over only two tablespaces. That's on SSDs though. > To complete the argument, the 4MB is just a worst case scenario, in reality > flushing the different context would be randomized over time, so the > frequency of flushing a context would be exactly the same in both cases > (shared or per table space context) if the checkpoints are the same size, > just that with shared table space each flushing potentially targets all > tablespace with a few pages, while with the other version each flushing > targets one table space only. The number of pages still in writeback (i.e. for which sync_file_range has been issued, but which haven't finished running yet) at the end of the checkpoint matters for the latency hit incurred by the fsync()s from smgrsync(); at least by my measurement. My current plan is to commit this with the current behaviour (as in this week[end]), and then do some actual benchmarking on this specific part. It's imo a relatively minor detail. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Now I cannot see how having one context per table space would have a significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Possibly, but how much? Do you have experimental data to back up that this is really an issue? We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of dirty buffers to manage for 16 table spaces, I do not see that as a major issue for the kernel. More thoughts about your theoretical argument: To complete the argument, the 4MB is just a worst case scenario, in reality flushing the different context would be randomized over time, so the frequency of flushing a context would be exactly the same in both cases (shared or per table space context) if the checkpoints are the same size, just that with shared table space each flushing potentially targets all tablespace with a few pages, while with the other version each flushing targets one table space only. So my handwaving analysis is that the flow of dirty buffers is the same with both approaches, but for the shared version buffers are more equaly distributed on table spaces, hence reducing sequential write effectiveness, and for the other the dirty buffers are grouped more clearly per table space, so it should get better sequential write performance. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, Now I cannot see how having one context per table space would have a significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Possibly, but how much? Do you have experimental data to back up that this is really an issue? We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of dirty buffers to manage for 16 table spaces, I do not see that as a major issue for the kernel. Thus you're more likely to see stalls by the kernel moving pages into writeback. I do not see the above data having a 30% negative impact on tps, given the quite small amount of data under discussion, and switching to random IOs cost so much that it must really be avoided. Without further experimental data, I still think that the one context per table space is the reasonnable choice. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote: > Now I cannot see how having one context per table space would have a > significant negative performance impact. The 'dirty data' etc. limits are global, not per block device. By having several contexts with unflushed dirty data the total amount of dirty data in the kernel increases. Thus you're more likely to see stalls by the kernel moving pages into writeback. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% (2) with 1 tablespace on 1 disk : 956.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% Well, that's not a particularly meaningful workload. You increased the number of flushed to the same number of disks considerably. It is just a simple workload designed to emphasize the effect of having one context shared for all table space instead of on per tablespace, without rewriting the patch and without a large host with multiple disks. For a meaningful comparison you'd have to compare using one writeback context for N tablespaces on N separate disks/raids, and using N writeback contexts for the same. Sure, it would be better to do that, but that would require (1) rewriting the patch, which is a small work, and also (2) having access to a machine with a number of disks/raids, that I do NOT have available. What happens in the 16 tb workload is that much smaller flushes are performed on the 16 files writen in parallel, so the tps performance is significantly degraded, despite the writes being sorted in each file. On one tb, all buffers flushed are in the same file, so flushes are much more effective. When the context is shared and checkpointer buffer writes are balanced against table spaces, then when the limit is reached the flushing gets few buffers per tablespace, so this limits sequential writes to few buffers, hence the performance degradation. So I can explain the performance degradation *because* the flush context is shared between the table spaces, which is a logical argument backed with experimental data, so it is better than handwaving. Given the available hardware, this is the best proof I can have that context should be per table space. Now I cannot see how having one context per table space would have a significant negative performance impact. So the logical conclusion for me is that without further experimental data it is better to have one context per table space. If you have a hardware with plenty disks available for testing, that would provide better data, obviously. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote: > > >>Random updates on 16 tables which total to 1.1GB of data, so this is in > >>buffer, no significant "read" traffic. > >> > >>(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps > >>per second avg, stddev [ min q1 median d3 max ] <=300tps > >>679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% > >> > >>(2) with 1 tablespace on 1 disk : 956.0 tps > >>per second avg, stddev [ min q1 median d3 max ] <=300tps > >>956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% > > > >Interesting. That doesn't reflect my own tests, even on rotating media, > >at all. I wonder if it's related to: > >https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > > > >If you use your 12.04 kernel, that'd not be fixed. Which might be a > >reason to do it as you suggest. > > > >Could you share the exact details of that workload? > > See attached scripts (sh to create the 16 tables in the default or 16 table > spaces, small sql bench script, stat computation script). > > The per-second stats were computed with: > > grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 > --limit=300 > > Host is 8 cpu 16 GB, 2 HDD in RAID 1. Well, that's not a particularly meaningful workload. You increased the number of flushed to the same number of disks considerably. For a meaningful comparison you'd have to compare using one writeback context for N tablespaces on N separate disks/raids, and using N writeback contexts for the same. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-03-07 09:41:51 -0800, Andres Freund wrote: > > Due to the difference in amount of RAM, each machine used different scales - > > the goal is to have small, ~50% RAM, >200% RAM sizes: > > > > 1) Xeon: 100, 400, 6000 > > 2) i5: 50, 200, 3000 > > > > The commits actually tested are > > > >cfafd8be (right before the first patch) > >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate. > >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ... > > Huh, now I'm a bit confused. These are the commits you tested? Those > aren't the ones doing sorting and flushing? To clarify: The reason we'd not expect to see much difference here is that the above commits really only have any affect above noise if you use synchronous_commit=off. Without async commit it's just one additional gettimeofday() call and a few additional branches in the wal writer every wal_writer_delay. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote: > 1) HP DL380 G5 (old rack server) > - 2x Xeon E5450, 16GB RAM (8 cores) > - 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC) > - RedHat 6 > - shared_buffers = 4GB > - min_wal_size = 2GB > - max_wal_size = 6GB > > 2) workstation with i5 CPU > - 1x i5-2500k, 8GB RAM > - 6x Intel S3700 100GB (in RAID0 for this benchmark) > - Gentoo > - shared_buffers = 2GB > - min_wal_size = 1GB > - max_wal_size = 8GB Thinking about with that hardware I'm not suprised if you're only seing small benefits. The amount of ram limits the amount of dirty data; and you have plenty have on-storage buffering in comparison to that. > Both machines were using the same kernel version 4.4.2 and default io > scheduler (cfq). The > > The test procedure was quite simple - pgbench with three different scales, > for each scale three runs, 1h per run (and 30 minutes of warmup before each > run). > > Due to the difference in amount of RAM, each machine used different scales - > the goal is to have small, ~50% RAM, >200% RAM sizes: > > 1) Xeon: 100, 400, 6000 > 2) i5: 50, 200, 3000 > > The commits actually tested are > >cfafd8be (right before the first patch) >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate. >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ... Huh, now I'm a bit confused. These are the commits you tested? Those aren't the ones doing sorting and flushing? > Also, I really wonder what will happen with non-default io schedulers. I > believe all the testing so far was done with cfq, so what happens on > machines that use e.g. "deadline" (as many DB machines actually do)? deadline and noop showed slightly bigger benefits in my testing. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Tomas, One of the goals of this thread (as I understand it) was to make the overall behavior smoother - eliminate sudden drops in transaction rate due to bursts of random I/O etc. One way to look at this is in terms of how much the tps fluctuates, so let's see some charts. I've collected per-second tps measurements (using the aggregation built into pgbench) but looking at that directly is pretty pointless because it's very difficult to compare two noisy lines jumping up and down. So instead let's see CDF of the per-second tps measurements. I.e. we have 3600 tps measurements, and given a tps value the question is what percentage of the measurements is below this value. y = Probability(tps <= x) We prefer higher values, and the ideal behavior would be that we get exactly the same tps every second. Thus an ideal CDF line would be a step line. Of course, that's rarely the case in practice. But comparing two CDF curves is easy - the line more to the right is better, at least for tps measurements, where we prefer higher values. Very nice and interesting graphs! Alas not easy to interpret for the HDD, as there are better/worse variation all along the distribution, the lines cross one another, so how it fares overall is unclear. Maybe a simple indication would be to compute the standard deviation on the per second tps? The median maybe interesting as well. I do have some more data, but those are the most interesting charts. The rest usually shows about the same thing (or nothing). Overall, I'm not quite sure the patches actually achieve the intended goals. On the 10k SAS drives I got better performance, but apparently much more variable behavior. On SSDs, I get a bit worse results. Indeed. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Random updates on 16 tables which total to 1.1GB of data, so this is in buffer, no significant "read" traffic. (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% (2) with 1 tablespace on 1 disk : 956.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% Interesting. That doesn't reflect my own tests, even on rotating media, at all. I wonder if it's related to: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 If you use your 12.04 kernel, that'd not be fixed. Which might be a reason to do it as you suggest. Could you share the exact details of that workload? See attached scripts (sh to create the 16 tables in the default or 16 table spaces, small sql bench script, stat computation script). The per-second stats were computed with: grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300 Host is 8 cpu 16 GB, 2 HDD in RAID 1. -- Fabien. ts_create.sh Description: Bourne shell script ts_test.sql Description: application/sql #! /usr/bin/env python # -*- coding: utf-8 -*- # # $Id: avg.py 1242 2016-02-06 14:44:02Z coelho $ # import argparse ap = argparse.ArgumentParser(description='show stats about data: count average stddev [min q1 median q3 max]...') ap.add_argument('--median', default=True, action='store_true', help='compute median and quartile values') ap.add_argument('--no-median', dest='median', default=True, action='store_false', help='do not compute median and quartile values') ap.add_argument('--more', default=False, action='store_true', help='show some more stats') ap.add_argument('--limit', type=float, default=None, help='set limit for counting below limit values') ap.add_argument('--length', type=int, default=None, help='set expected length, assume 0 if beyond') ap.add_argument('--precision', type=int, default=1, help='floating point precision') ap.add_argument('file', nargs='*', help='list of files to process') opt = ap.parse_args() # option consistency if opt.limit != None: opt.more = True if opt.more: opt.median = True # reset arguments for fileinput import sys sys.argv[1:] = opt.file import fileinput n, skipped, vals = 0, 0, [] k, vmin, vmax = None, None, None sum1, sum2 = 0.0, 0.0 for line in fileinput.input(): try: v = float(line) if opt.median: # keep track only if needed vals.append(v) if k is None: # first time k, vmin, vmax = v, v, v else: # next time vmin = min(vmin, v) vmax = max(vmax, v) n += 1 vmk = v - k sum1 += vmk sum2 += vmk * vmk except ValueError: # float conversion failed skipped += 1 if n == 0: # avoid ops on None below k, vmin, vmax = 0.0, 0.0, 0.0 if opt.length: assert "some data seen", n > 0 missing = int(opt.length) - len(vals) assert "positive number of missing data", missing >= 0 if missing > 0: print("warning: %d missing data, expanding with zeros" % missing) if opt.median: vals += [ 0.0 ] * missing vmin = min(vmin, 0.0) sum1 += - k * missing sum2 += k * k * missing n += missing assert len(vals) == int(opt.length) if opt.median: assert "consistent length", len(vals) == n # five numbers... # numpy.percentile requires numpy at least 1.9 to use 'midpoint' # statistics.median requires python 3.4 (?) def median(vals, start, length): if len(vals) == 1: start, length = 0, 1 m, odd = divmod(length, 2) #return 0.5 * (vals[start + m + odd - 1] + vals[start + m]) return vals[start + m] if odd else \ 0.5 * (vals[start + m-1] + vals[start + m]) # return ratio of below limit (limit included) values def below(vals, limit): # hmmm... short but generates a list #return float(len([ v for v in vals if v <= limit ])) / len(vals) below_limit = 0 for v in vals: if v <= limit: below_limit += 1 return float(below_limit) / len(vals) # float prettyprint with precision def f(v): return ('%.' + str(opt.precision) + 'f') % v # output if skipped: print("warning: %d lines skipped" % skipped) if n > 0: # show result (hmmm, precision is truncated...) from math import sqrt avg, stddev = k + sum1 / n, sqrt((sum2 - (sum1 * sum1) / n) / n) if opt.median: vals.sort() med = median(vals, 0, len(vals)) # not sure about odd/even issues here... q3 needs fixing if len is 1 q1 = median(vals, 0, len(vals) // 2) q3 = median(vals, (len(vals)+1) // 2, len(vals) // 2) # build summary message msg = "avg over %d: %s ± %s [%s, %s, %s, %s, %s]" % \ (n, f(avg), f(stddev), f(vmin), f(q1), f(med), f(q3), f(vmax)) if opt.more: limit = opt.limit if opt.limit != None else 0.1 * med # msg += " <=%s:" % f(limit) msg += " %s%%" % f(100.0 * below(vals, limit)) else: msg = "avg over %d: %s ± %s [%s, %s]" % \ (n, f(avg), f(stddev), f(vmin), f(vmax)) else: msg = "no data
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 11:05:20 -0500, Tom Lane wrote: > Andres Freundwrites: > > Interesting. That doesn't reflect my own tests, even on rotating media, > > at all. I wonder if it's related to: > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > > > If you use your 12.04 kernel, that'd not be fixed. Which might be a > > reason to do it as you suggest. > > Hmm ... that kernel commit is less than 4 months old. Would it be > reflected in *any* production kernels yet? Probably not - so far I though it mainly has some performance benefits on relatively extreme workloads; where without the patch, flushing still is better performancewise than not flushing. But in the scenario Fabien has brought up it seems quite possible that sync_file_range emitting "storage cache flush" instructions, could explain the rather large performance difference between his and my experiments. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Andres Freundwrites: > Interesting. That doesn't reflect my own tests, even on rotating media, > at all. I wonder if it's related to: > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 > If you use your 12.04 kernel, that'd not be fixed. Which might be a > reason to do it as you suggest. Hmm ... that kernel commit is less than 4 months old. Would it be reflected in *any* production kernels yet? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-22 14:11:05 +0100, Fabien COELHO wrote: > > >I did a quick & small test with random updates on 16 tables with > >checkpoint_flush_after=16 checkpoint_timeout=30 > > Another run with more "normal" settings and over 1000 seconds, so less > "quick & small" that the previous one. > > checkpoint_flush_after = 16 > checkpoint_timeout = 5min # default > shared_buffers = 2GB # 1/8 of available memory > > Random updates on 16 tables which total to 1.1GB of data, so this is in > buffer, no significant "read" traffic. > > (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps > per second avg, stddev [ min q1 median d3 max ] <=300tps > 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% > > (2) with 1 tablespace on 1 disk : 956.0 tps > per second avg, stddev [ min q1 median d3 max ] <=300tps > 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% Interesting. That doesn't reflect my own tests, even on rotating media, at all. I wonder if it's related to: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5 If you use your 12.04 kernel, that'd not be fixed. Which might be a reason to do it as you suggest. Could you share the exact details of that workload? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
I did a quick & small test with random updates on 16 tables with checkpoint_flush_after=16 checkpoint_timeout=30 Another run with more "normal" settings and over 1000 seconds, so less "quick & small" that the previous one. checkpoint_flush_after = 16 checkpoint_timeout = 5min # default shared_buffers = 2GB # 1/8 of available memory Random updates on 16 tables which total to 1.1GB of data, so this is in buffer, no significant "read" traffic. (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5% (2) with 1 tablespace on 1 disk : 956.0 tps per second avg, stddev [ min q1 median d3 max ] <=300tps 956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1% -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, AFAICR I used a "flush context" for each table space in some version I submitted, because I do think that this whole writeback logic really does make sense *per table space*, which suggest that there should be as many write backs contexts as table spaces, otherwise the positive effect may going to be totally lost of tables spaces are used. Any thoughts? Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I did a quick & small test with random updates on 16 tables with checkpoint_flush_after=16 checkpoint_timeout=30 (1) with 16 tablespaces (1 per table, but same disk) : tps = 1100, 27% time under 100 tps (2) with 1 tablespace : tps = 1200, 3% time under 100 tps This result is logical: with one writeback context shared between tablespaces the sync_file_range is issued on a few buffers per file at a time on the 16 files, no coalescing occurs there, so this result in random IOs, while with one table space all writes are aggregated per file. ISTM that this quick test shows that a writeback context are relevant per tablespace, as I expected. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. They don't cause much space usage, and we access the values frequently. So why not store them? The same question would work the other way around: these values are one division away, why not compute them when needed? No big deal. [...] Given realistic amounts of memory the max potential "skew" seems fairly small with float8. If we ever flush one buffer "too much" for a tablespace it's pretty much harmless. I do agree. I'm suggesting that a comment should be added to justify why float8 accuracy is okay. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? Hm. I think we really should use a memory context for all of this - we could after all error out somewhere in the middle... I'm not sure that a memory context is justified here, there are only two mallocs and the checkpointer works for very long times. I think that it is simpler to just get the malloc/free right. [...] I'm not arguing for ripping it out, what I mean is that we don't set a nondefault value for the GUCs on platforms with just posix_fadivise available... Ok with that. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
[...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. Sure, but this is not my point. My point is that "sync_file_range" moves buffers to the device io queues, which are per device. If there is one queue in pg and many queues on many devices, the whole point of coalescing to get sequential writes is somehow lost. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. Ok, some people would not benefit if the use many tablespaces on one device, too bad but that does not look like a useful very setting anyway, and I do not think it would harm much in this case. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. ISTM that you would need a significant number of tablespaces to see the benefit. If you do not do that, the more table spaces the more random the IOs, which is disappointing. Also, "the cost is marginal", so I do not see any good argument not to do it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hi, On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote: > * CpktSortItem: > > I think that allocating 20 bytes per buffer in shared memory is a little on > the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes > to hold 4 values, could be one byte or even 2 bits somewhere. Also, there > are very few tablespaces, they could be given a small number and this number > could be used instead of the Oid, so the space requirement could be reduced > to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping > 4 bytes alignement and also getting 8 byte alignement... If this is too > much, I have shown that it can work with only 4 bytes per buffer, as the > sorting is really just a performance optimisation and is not broken if some > stuff changes between sorting & writeback, but you did not like the idea. If > the amount of shared memory required is a significant concern, it could be > resurrected, though. This is less than 0.2 % of memory related to shared buffers. We have the same amount of memory allocated in CheckpointerShmemSize(), and nobody has complained so far. And sorry, going back to the previous approach isn't going to fly, and I've no desire to discuss that *again*. > ISTM that "progress" and "progress_slice" only depend on num_scanned and > per-tablespace num_to_scan and total num_to_scan, so they are somehow > redundant and the progress could be recomputed from the initial figures > when needed. They don't cause much space usage, and we access the values frequently. So why not store them? > If these fields are kept, I think that a comment should justify why float8 > precision is okay for the purpose. I think it is quite certainly fine in the > worst case with 32 bits buffer_ids, but it would not be if this size is > changed someday. That seems pretty much unrelated to having the fields - the question of accuracy plays a role regardless, no? Given realistic amounts of memory the max potential "skew" seems fairly small with float8. If we ever flush one buffer "too much" for a tablespace it's pretty much harmless. > ISTM that nearly all of the collected data on the second sweep could be > collected on the first sweep, so that this second sweep could be avoided > altogether. The only missing data is the index of the first buffer in the > array, which can be computed by considering tablespaces only, sweeping over > buffers is not needed. That would suggest creating the heap or using a hash > in the initial buffer sweep to keep this information. This would also > provide a point where to number tablespaces for compressing the CkptSortItem > struct. Doesn't seem worth the complexity to me. > I'm wondering about calling CheckpointWriteDelay on each round, maybe > a minimum amount of write would make sense. Why? There's not really much benefit of doing more work than needed. I think we should sleep far shorter in many cases, but that's indeed a separate issue. > I see a binary_heap_allocate but no corresponding deallocation, this > looks like a memory leak... or is there some magic involved? Hm. I think we really should use a memory context for all of this - we could after all error out somewhere in the middle... > >I think this patch primarily needs: > >* Benchmarking on FreeBSD/OSX to see whether we should enable the > > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm > > inclined to leave it off till then. > > I do not have that. As "msync" seems available on Linux, it is possible to > force using it with a "ifdef 0" to skip sync_file_range and check whether it > does some good there. Unfortunately it doesn't work well on linux: * On many OSs msync() on a mmap'ed file triggers writeback. On linux * it only does so when MS_SYNC is specified, but then it does the * writeback synchronously. Luckily all common linux systems have * sync_file_range(). This is preferrable over FADV_DONTNEED because * it doesn't flush out clean data. I've verified beforehand, with a simple demo program, that msync(MS_ASYNC) does something reasonable of freebsd... > Idem for the "posix_fadvise" stuff. I can try to do > that, but it takes time to do so, if someone can test on other OS it would > be much better. I think that if it works it should be kept in, so it is just > a matter of testing it. I'm not arguing for ripping it out, what I mean is that we don't set a nondefault value for the GUCs on platforms with just posix_fadivise available... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote: > >>In the discussion in the wal section, I'm not sure about the effect of > >>setting writebacks on SSD, [...] > > > >Yea, that paragraph needs some editing. I think we should basically > >remove that last sentence. > > Ok, fine with me. Does that mean that flushing as a significant positive > impact on SSD in your tests? Yes. The reason we need flushing is that the kernel amasses dirty pages, and then flushes them at once. That hurts for both SSDs and rotational media. Sorting is the the bigger question, but I've seen it have clearly beneficial performance impacts. I guess if you look at devices with a internal block size bigger than 8k, you'd even see larger differences. > >>Maybe the merging strategy could be more aggressive than just strict > >>neighbors? > > > >I don't think so. If you flush more than neighbouring writes you'll > >often end up flushing buffers dirtied by another backend, causing > >additional stalls. > > Ok. Maybe the neightbor definition could be relaxed just a little bit so > that small holes are overtake, but not large holes? If there is only a few > pages in between, even if written by another process, then writing them > together should be better? Well, this can wait for a clear case, because > hopefully the OS will recoalesce them behind anyway. I'm against doing so without clear measurements of a benefit. > >Also because the infrastructure is used for more than checkpoint > >writes. There's absolutely no ordering guarantees there. > > Yep, but not much benefit to expect from a few dozens random pages either. Actually, there's kinda frequently a benefit observable. Even if few requests can be merged, doing IO requests in an order more likely doable within a few rotations is beneficial. Also, the cost is marginal, so why worry? > >>[...] I do think that this whole writeback logic really does make > >>sense *per table space*, > > > >Leads to less regular IO, because if your tablespaces are evenly sized > >(somewhat common) you'll sometimes end up issuing sync_file_range's > >shortly after each other. For latency outside checkpoints it's > >important to control the total amount of dirty buffers, and that's > >obviously independent of tablespaces. > > I do not understand/buy this argument. > > The underlying IO queue is per device, and table spaces should be per device > as well (otherwise what the point?), so you should want to coalesce and > "writeback" pages per device as wel. Calling sync_file_range on distinct > devices should probably be issued more or less randomly, and should not > interfere one with the other. The kernel's dirty buffer accounting is global, not per block device. It's also actually rather common to have multiple tablespaces on a single block device. Especially if SANs and such are involved; where you don't even know which partitions are on which disks. > If you use just one context, the more table spaces the less performance > gains, because there is less and less aggregation thus sequential writes per > device. > > So for me there should really be one context per tablespace. That would > suggest a hashtable or some other structure to keep and retrieve them, which > would not be that bad, and I think that it is what is needed. That'd be much easier to do by just keeping the context in the per-tablespace struct. But anyway, I'm really doubtful about going for that; I had it that way earlier, and observing IO showed it not being beneficial. > >>For the checkpointer, a key aspect is that the scheduling process goes > >>to sleep from time to time, and this sleep time looked like a great > >>opportunity to do this kind of flushing. You choose not to take advantage > >>of the behavior, why? > > > >Several reasons: Most importantly there's absolutely no guarantee that > >you'll ever end up sleeping, it's quite common to happen only seldomly. > > Well, that would be under a situation when pg is completely unresponsive. > More so, this behavior *makes* pg unresponsive. No. The checkpointer being bottlenecked on actual IO performance doesn't impact production that badly. It'll just sometimes block in sync_file_range(), but the IO queues will have enough space to frequently give way to other backends, particularly to synchronous reads (most pg reads) and synchronous writes (fdatasync()). So a single checkpoint will take a bit longer, but otherwise the system will mostly keep up the work in a regular manner. Without the sync_file_range() calls the kernel will amass dirty buffers until global dirty limits are reached, which then will bring the whole system to a standstill. It's pretty common that checkpoint_timeout is too short to be able to write all shared_buffers out, in that case it's much better to slow down the whole checkpoint, instead of being incredibly slow at the end. > >I also don't really believe it helps that much, although that's a complex > >argument to make. > > Yep. My
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, Here is a review for the second patch. For 0002 I've recently changed: * Removed the sort timing information, we've proven sufficiently that it doesn't take a lot of time. I put it there initialy to demonstrate that there was no cache performance issue when sorting on just buffer indexes. As it is always small, I agree that it is not needed. Well, it could be still be in seconds on a very large shared buffers setting with a very large checkpoint, but then the checkpoint would be tremendously huge... * Minor comment polishing. Patch applies and checks on Linux. * CpktSortItem: I think that allocating 20 bytes per buffer in shared memory is a little on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, there are very few tablespaces, they could be given a small number and this number could be used instead of the Oid, so the space requirement could be reduced to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping 4 bytes alignement and also getting 8 byte alignement... If this is too much, I have shown that it can work with only 4 bytes per buffer, as the sorting is really just a performance optimisation and is not broken if some stuff changes between sorting & writeback, but you did not like the idea. If the amount of shared memory required is a significant concern, it could be resurrected, though. * CkptTsStatus: As I suggested in the other mail, I think that this structure should also keep a per tablespace WritebackContext so that coalescing is done per tablespace. ISTM that "progress" and "progress_slice" only depend on num_scanned and per-tablespace num_to_scan and total num_to_scan, so they are somehow redundant and the progress could be recomputed from the initial figures when needed. If these fields are kept, I think that a comment should justify why float8 precision is okay for the purpose. I think it is quite certainly fine in the worst case with 32 bits buffer_ids, but it would not be if this size is changed someday. * BufferSync After a first sweep to collect buffers to write, they are sorted, and then there those buffers are swept again to compute some per tablespace data and organise a heap. ISTM that nearly all of the collected data on the second sweep could be collected on the first sweep, so that this second sweep could be avoided altogether. The only missing data is the index of the first buffer in the array, which can be computed by considering tablespaces only, sweeping over buffers is not needed. That would suggest creating the heap or using a hash in the initial buffer sweep to keep this information. This would also provide a point where to number tablespaces for compressing the CkptSortItem struct. I'm wondering about calling CheckpointWriteDelay on each round, maybe a minimum amount of write would make sense. This remark is independent of this patch. Probably it works fine because after a sleep the checkpointer is behind enough so that it will write a bunch of buffers before sleeping again. I see a binary_heap_allocate but no corresponding deallocation, this looks like a memory leak... or is there some magic involved? There are some debug stuff to remove in #ifdefs. I think that the buffer/README should be updated with explanations about sorting in the checkpointer. I think this patch primarily needs: * Benchmarking on FreeBSD/OSX to see whether we should enable the mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm inclined to leave it off till then. I do not have that. As "msync" seems available on Linux, it is possible to force using it with a "ifdef 0" to skip sync_file_range and check whether it does some good there. Idem for the "posix_fadvise" stuff. I can try to do that, but it takes time to do so, if someone can test on other OS it would be much better. I think that if it works it should be kept in, so it is just a matter of testing it. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, [...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. Note: I think that an easy way to do that in the "checkpoint sort" patch is simply to keep a WritebackContext in CkptTsStatus structure which is per table space in the checkpointer. For bgwriter & backends it can wait, there is few "writeback" coalescing because IO should be pretty random, so it does not matter much. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hallo Andres, In some previous version I think a warning was shown if the feature was requested but not available. I think we should either silently ignore it, or error out. Warnings somewhere in the background aren't particularly meaningful. I like "ignoring with a warning" in the log file, because when things do not behave as expected that is where I'll be looking. I do not think that it should error out. The sgml documentation about "*_flush_after" configuration parameter talks about bytes, but the actual unit should be buffers. The unit actually is buffers, but you can configure it using bytes. We've done the same for other GUCs (shared_buffers, wal_buffers, ...). Refering to bytes is easier because you don't have to explain that it depends on compilation settings how many data it actually is and such. So I understand that it works with kb as well. Now I do not think that it would need a lot if explanations if you say that it is a number of pages, and I think that a number of pages is significant because it is a number of IO requests to be coalesced, eventually. In the discussion in the wal section, I'm not sure about the effect of setting writebacks on SSD, [...] Yea, that paragraph needs some editing. I think we should basically remove that last sentence. Ok, fine with me. Does that mean that flushing as a significant positive impact on SSD in your tests? However it does not address the point that bgwriter and backends basically issue random writes, [...] The benefit is primarily that you don't collect large amounts of dirty buffers in the kernel page cache. In most cases the kernel will not be able to coalesce these writes either... I've measured *massive* performance latency differences for workloads that are bigger than shared buffers - because suddenly bgwriter / backends do the majority of the writes. Flushing in the checkpoint quite possibly makes nearly no difference in such cases. So I understand that there is a positive impact under some load. Good! Maybe the merging strategy could be more aggressive than just strict neighbors? I don't think so. If you flush more than neighbouring writes you'll often end up flushing buffers dirtied by another backend, causing additional stalls. Ok. Maybe the neightbor definition could be relaxed just a little bit so that small holes are overtake, but not large holes? If there is only a few pages in between, even if written by another process, then writing them together should be better? Well, this can wait for a clear case, because hopefully the OS will recoalesce them behind anyway. struct WritebackContext: keeping a pointer to guc variables is a kind of trick, I think it deserves a comment. It has, it's just in WritebackContextInit(). Can duplicateit. I missed it, I expected something in the struct definition. Do not duplicate, but cross reference it? IssuePendingWritebacks: I understand that qsort is needed "again" because when balancing writes over tablespaces they may be intermixed. Also because the infrastructure is used for more than checkpoint writes. There's absolutely no ordering guarantees there. Yep, but not much benefit to expect from a few dozens random pages either. [...] I do think that this whole writeback logic really does make sense *per table space*, Leads to less regular IO, because if your tablespaces are evenly sized (somewhat common) you'll sometimes end up issuing sync_file_range's shortly after each other. For latency outside checkpoints it's important to control the total amount of dirty buffers, and that's obviously independent of tablespaces. I do not understand/buy this argument. The underlying IO queue is per device, and table spaces should be per device as well (otherwise what the point?), so you should want to coalesce and "writeback" pages per device as wel. Calling sync_file_range on distinct devices should probably be issued more or less randomly, and should not interfere one with the other. If you use just one context, the more table spaces the less performance gains, because there is less and less aggregation thus sequential writes per device. So for me there should really be one context per tablespace. That would suggest a hashtable or some other structure to keep and retrieve them, which would not be that bad, and I think that it is what is needed. For the checkpointer, a key aspect is that the scheduling process goes to sleep from time to time, and this sleep time looked like a great opportunity to do this kind of flushing. You choose not to take advantage of the behavior, why? Several reasons: Most importantly there's absolutely no guarantee that you'll ever end up sleeping, it's quite common to happen only seldomly. Well, that would be under a situation when pg is completely unresponsive. More so, this behavior *makes* pg unresponsive. If you're bottlenecked on IO, you can end up being behind all the time. Hopefully
Re: [HACKERS] checkpointer continuous flushing - V18
On Sun, Feb 21, 2016 at 3:37 AM, Andres Freundwrote: >> The documentation seems to use "flush" but the code talks about "writeback" >> or "flush", depending. I think one vocabulary, whichever it is, should be >> chosen and everything should stick to it, otherwise everything look kind of >> fuzzy and raises doubt for the reader (is it the same thing? is it something >> else?). I initially used "flush", but it seems a bad idea because it has >> nothing to do with the flush function, so I'm fine with writeback or anything >> else, I just think that *one* word should be chosen and used everywhere. > > Hm. I think there might be a semantic distinction between these two terms. Doesn't writeback mean writing pages to disk, and flushing mean making sure that they are durably on disk? So for example when the Linux kernel thinks there is too much dirty data, it initiates writeback, not a flush; on the other hand, at transaction commit, we initiate a flush, not writeback. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hi, On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote: > >* Currently *_flush_after can be set to a nonzero value, even if there's > > no support for flushing on that platform. Imo that's ok, but perhaps > > other people's opinion differ. > > In some previous version I think a warning was shown of the feature was > requested but not available. I think we should either silently ignore it, or error out. Warnings somewhere in the background aren't particularly meaningful. > Here are some quick comments on the patch: > > Patch applies cleanly on head. Compiled and checked on Linux. Compilation > issues on other systems, see below. For those I've already pushed a small fixup commit to git... Stupid mistake. > The documentation seems to use "flush" but the code talks about "writeback" > or "flush", depending. I think one vocabulary, whichever it is, should be > chosen and everything should stick to it, otherwise everything look kind of > fuzzy and raises doubt for the reader (is it the same thing? is it something > else?). I initially used "flush", but it seems a bad idea because it has > nothing to do with the flush function, so I'm fine with writeback or anything > else, I just think that *one* word should be chosen and used everywhere. Hm. > The sgml documentation about "*_flush_after" configuration parameter talks > about bytes, but the actual unit should be buffers. The unit actually is buffers, but you can configure it using bytes. We've done the same for other GUCs (shared_buffers, wal_buffers, ...). Refering to bytes is easier because you don't have to explain that it depends on compilation settings how many data it actually is and such. > Also, the maximum value (128 ?) should appear in the text. \ Right. > In the discussion in the wal section, I'm not sure about the effect of > setting writebacks on SSD, but I think that you have made some tests > so maybe you have an answer and the corresponding section could be > written with some more definitive text than "probably brings no > benefit on SSD". Yea, that paragraph needs some editing. I think we should basically remove that last sentence. > A good point of the whole approach is that it is available to all kind > of pg processes. Exactly. > However it does not address the point that bgwriter and > backends basically issue random writes, so I would not expect much positive > effect before these writes are somehow sorted, which means doing some > compromise in the LRU/LFU logic... The benefit is primarily that you don't collect large amounts of dirty buffers in the kernel page cache. In most cases the kernel will not be able to coalesce these writes either... I've measured *massive* performance latency differences for workloads that are bigger than shared buffers - because suddenly bgwriter / backends do the majority of the writes. Flushing in the checkpoint quite possibly makes nearly no difference in such cases. > well, all this is best kept for later, and I'm fine to have the logic > flushing logic there. I'm wondering why you choose 16 & 64 as default > for backends & bgwriter, though. I chose a small value for backends because there often are a large number of backends, and thus the amount of dirty data of each adds up. I used a larger value for bgwriter because I saw that ending up using bigger IOs. > IssuePendingWritebacks: you merge only strictly neightboring writes. > Maybe the merging strategy could be more aggressive than just strict > neighbors? I don't think so. If you flush more than neighbouring writes you'll often end up flushing buffers dirtied by another backend, causing additional stalls. And if the writes aren't actually neighbouring there's not much gained from issuing them in one sync_file_range call. > mdwriteback: all variables could be declared within the while, I do not > understand why some are in and some are out. Right. > ISTM that putting writeback management at the relation level does not > help a lot, because you have to translate again from relation to > files. Sure, but what's the problem with that? That's how normal read/write IO works as well? > struct WritebackContext: keeping a pointer to guc variables is a kind of > trick, I think it deserves a comment. It has, it's just in WritebackContextInit(). Can duplicateit. > ScheduleBufferTagForWriteback: the "pending" variable is not very > useful. Shortens line length a good bit, at no cost. > IssuePendingWritebacks: I understand that qsort is needed "again" > because when balancing writes over tablespaces they may be intermixed. Also because the infrastructure is used for more than checkpoint writes. There's absolutely no ordering guarantees there. > AFAICR I used a "flush context" for each table space in some version > I submitted, because I do think that this whole writeback logic really > does make sense *per table space*, which suggest that there should be as > many write backs contexts as table spaces, otherwise the
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, For 0001 I've recently changed: * Don't schedule writeback after smgrextend() - that defeats linux delayed allocation mechanism, increasing fragmentation noticeably. * Add docs for the new GUC variables * comment polishing * BackendWritebackContext now isn't dynamically allocated anymore I think this patch primarily needs: * review of the docs, not sure if they're easy enough to understand. Some language polishing might also be needed. Yep, see below. * review of the writeback API, combined with the smgr/md.c changes. See various comments below. * Currently *_flush_after can be set to a nonzero value, even if there's no support for flushing on that platform. Imo that's ok, but perhaps other people's opinion differ. In some previous version I think a warning was shown of the feature was requested but not available. Here are some quick comments on the patch: Patch applies cleanly on head. Compiled and checked on Linux. Compilation issues on other systems, see below. When pages are written by a process (checkpointer, bgwriter, backend worker), the list of recently written pages is kept and every so often an advisory fsync (sync_file_range, other options for other systems) is issued so that the data is sent to the io system without relying on more or less (un)controllable os policy. The documentation seems to use "flush" but the code talks about "writeback" or "flush", depending. I think one vocabulary, whichever it is, should be chosen and everything should stick to it, otherwise everything look kind of fuzzy and raises doubt for the reader (is it the same thing? is it something else?). I initially used "flush", but it seems a bad idea because it has nothing to do with the flush function, so I'm fine with writeback or anything else, I just think that *one* word should be chosen and used everywhere. The sgml documentation about "*_flush_after" configuration parameter talks about bytes, but the actual unit should be buffers. I think that keeping a number of buffers should be fine, because that is what the internal stuff will manage, not bytes. Also, the maximum value (128 ?) should appear in the text. In the discussion in the wal section, I'm not sure about the effect of setting writebacks on SSD, but I think that you have made some tests so maybe you have an answer and the corresponding section could be written with some more definitive text than "probably brings no benefit on SSD". A good point of the whole approach is that it is available to all kind of pg processes. However it does not address the point that bgwriter and backends basically issue random writes, so I would not expect much positive effect before these writes are somehow sorted, which means doing some compromise in the LRU/LFU logic... well, all this is best kept for later, and I'm fine to have the logic flushing logic there. I'm wondering why you choose 16 & 64 as default for backends & bgwriter, though. IssuePendingWritebacks: you merge only strictly neightboring writes. Maybe the merging strategy could be more aggressive than just strict neighbors? mdwriteback: all variables could be declared within the while, I do not understand why some are in and some are out. ISTM that putting writeback management at the relation level does not help a lot, because you have to translate again from relation to files. The good news is that it should work as well, and that it does avoid the issue that the file may have been closed in between, so why not. The PendingWriteback struct looks useless. I think it should be removed, and maybe put back if one day if it is needed, which I rather doubt it. struct WritebackContext: keeping a pointer to guc variables is a kind of trick, I think it deserves a comment. ScheduleBufferTagForWriteback: the "pending" variable is not very useful. Maybe consider shortening the "pending_writebacks" field name to "writebacks"? IssuePendingWritebacks: I understand that qsort is needed "again" because when balancing writes over tablespaces they may be intermixed. AFAICR I used a "flush context" for each table space in some version I submitted, because I do think that this whole writeback logic really does make sense *per table space*, which suggest that there should be as many write backs contexts as table spaces, otherwise the positive effect may going to be totally lost of tables spaces are used. Any thoughts? Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always true, I think, it is already checked in the initialization and when setting gucs. SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock. I guess it is okay because it is still pinned. pg_flush_data: in the first #elif, "context" is undeclared line 446. Label "out" is not defined line 455. In the second #elif, "context" is undeclared line 490 and label "out" line 500 is not defined either. For the checkpointer, a key aspect is that the scheduling process goes to sleep from time to
Re: [HACKERS] checkpointer continuous flushing - V16
On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHOwrote: >> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO >> somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge. >> >> >> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu > > > Interesting! To summarize it, 25% performance degradation from best kernel > (2.6.32) to worst (3.2.0), that is indeed significant. As far as I recall, the OS cache eviction is very aggressive in 3.2, so it would be possible that data from the FS cache that was just read could be evicted even if it was not used yet. Thie represents a large difference when the database does not fit in RAM. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-19 22:46:44 +0100, Fabien COELHO wrote: > > Hello Andres, > > >Here's the next two (the most important) patches of the series: > >0001: Allow to trigger kernel writeback after a configurable number of > >writes. > >0002: Checkpoint sorting and balancing. > > I will look into these two in depth. > > Note that I would have ordered them in reverse because sorting is nearly > always very beneficial, and "writeback" (formely called flushing) is then > nearly always very beneficial on sorted buffers. I had it that way earlier. I actually saw pretty large regressions from sorting alone in some cases as well, apparently because the kernel submits much larger IOs to disk; although that probably only shows on SSDs. This way the modifications imo look a trifle better ;). I'm intending to commit both at the same time, keep them separate only because they're easier to ynderstand separately. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
Hello Andres, Here's the next two (the most important) patches of the series: 0001: Allow to trigger kernel writeback after a configurable number of writes. 0002: Checkpoint sorting and balancing. I will look into these two in depth. Note that I would have ordered them in reverse because sorting is nearly always very beneficial, and "writeback" (formely called flushing) is then nearly always very beneficial on sorted buffers. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V18
On 2016-02-04 16:54:58 +0100, Andres Freund wrote: > Hi, > > Fabien asked me to post a new version of the checkpoint flushing patch > series. While this isn't entirely ready for commit, I think we're > getting closer. > > I don't want to post a full series right now, but my working state is > available on > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush > git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush I've updated the git tree. Here's the next two (the most important) patches of the series: 0001: Allow to trigger kernel writeback after a configurable number of writes. 0002: Checkpoint sorting and balancing. For 0001 I've recently changed: * Don't schedule writeback after smgrextend() - that defeats linux delayed allocation mechanism, increasing fragmentation noticeably. * Add docs for the new GUC variables * comment polishing * BackendWritebackContext now isn't dynamically allocated anymore I think this patch primarily needs: * review of the docs, not sure if they're easy enough to understand. Some language polishing might also be needed. * review of the writeback API, combined with the smgr/md.c changes. * Currently *_flush_after can be set to a nonzero value, even if there's no support for flushing on that platform. Imo that's ok, but perhaps other people's opinion differ. For 0002 I've recently changed: * Removed the sort timing information, we've proven sufficiently that it doesn't take a lot of time. * Minor comment polishing. I think this patch primarily needs: * Benchmarking on FreeBSD/OSX to see whether we should enable the mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm inclined to leave it off till then. Regards, Andres >From 58aee659417372f3dda4420d8f2a4f4d41c56d31 Mon Sep 17 00:00:00 2001 From: Andres FreundDate: Fri, 19 Feb 2016 12:13:05 -0800 Subject: [PATCH 1/4] Allow to trigger kernel writeback after a configurable number of writes. Currently writes to the main data files of postgres all go through the OS page cache. This means that currently several operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. TODO: * verify msync codepath * properly detect mmap() && msync(MS_ASYNC) support, use it by default if available and sync_file_range is *not* available Discussion: alpine.DEB.2.10.150601132.28433@sto Author: Fabien Coelho and Andres Freund --- doc/src/sgml/config.sgml | 81 +++ doc/src/sgml/wal.sgml | 13 +++ src/backend/postmaster/bgwriter.c | 8 +- src/backend/storage/buffer/buf_init.c | 5 +
Re: [HACKERS] checkpointer continuous flushing - V16
Hallo Patric, Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge. https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu Interesting! To summarize it, 25% performance degradation from best kernel (2.6.32) to worst (3.2.0), that is indeed significant. You might consider upgrading your kernel to 3.13 LTS. It's quite easy [...] There are other stuff running on the hardware that I do not wish to touch, so upgrading the particular host is currently not an option, otherwise I would have switched to trusty. Thanks for the pointer. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Fabien, Fabien COELHO schrieb am 19.02.2016 um 16:04: > >>> [...] Ubuntu 12.04 LTS (precise) >> >> That's with 12.04's standard kernel? > > Yes. Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge. https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu You might consider upgrading your kernel to 3.13 LTS. It's quite easy normally: https://wiki.ubuntu.com/Kernel/LTSEnablementStack /Patric -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) Comment: GnuPT 2.5.2 iEYEARECAAYFAlbHW4AACgkQfGgGu8y7ypC1EACgy8mW6AoaWjKycbuAnCZ3CEPW Al8AmwfF0smqmDvNsaPkq0dAtop7jP5M =TxT+ -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello. Based on these results I think 32 will be a good default for checkpoint_flush_after? There's a few cases where 64 showed to be beneficial, and some where 32 is better. I've seen 64 perform a bit better in some cases here, but the differences were not too big. Yes, these many runs show that 32 is basically as good or better than 64. I'll do some runs with 16/48 to have some more data. I gather that you didn't play with backend_flush_after/bgwriter_flush_after, i.e. you left them at their default values? Especially backend_flush_after can have a significant positive and negative performance impact. Indeed, non reported configuration options have their default values. There were also minor changes in the default options for logging (prefix, checkpoint, ...), but nothing significant, and always the same for all runs. [...] Ubuntu 12.04 LTS (precise) That's with 12.04's standard kernel? Yes. checkpoint_flush_after = { none, 0, 32, 64 } Did you re-initdb between the runs? Yes, all runs are from scratch (initdb, pgbench -i, some warmup...). I've seen massively varying performance differences due to autovacuum triggered analyzes. It's not completely deterministic when those run, and on bigger scale clusters analyze can take ages, while holding a snapshot. Yes, I agree that probably the performance changes on long vs short runs (andres00c vs andres00b) is due to autovacuum. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hi, On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote: > Below the results of a lot of tests with pgbench to exercise checkpoints on > the above version when fetched. Wow, that's a great test series. > Overall comments: > - sorting & flushing is basically always a winner > - benchmarking with short runs on large databases is a bad idea >the results are very different if a longer run is used >(see andres00b vs andres00c) Based on these results I think 32 will be a good default for checkpoint_flush_after? There's a few cases where 64 showed to be beneficial, and some where 32 is better. I've seen 64 perform a bit better in some cases here, but the differences were not too big. I gather that you didn't play with backend_flush_after/bgwriter_flush_after, i.e. you left them at their default values? Especially backend_flush_after can have a significant positive and negative performance impact. > 16 GB 2 cpu 8 cores > 200 GB RAID1 HDD, ext4 FS > Ubuntu 12.04 LTS (precise) That's with 12.04's standard kernel? > postgresql.conf: >shared_buffers = 1GB >max_wal_size = 1GB >checkpoint_timeout = 300s >checkpoint_completion_target = 0.8 >checkpoint_flush_after = { none, 0, 32, 64 } Did you re-initdb between the runs? I've seen massively varying performance differences due to autovacuum triggered analyzes. It's not completely deterministic when those run, and on bigger scale clusters analyze can take ages, while holding a snapshot. > Hmmm, interesting: maintenance_work_mem seems to have some influence on > performance, although it is not too consistent between settings, probably > because as the memory is used to its limit the performance is quite > sensitive to the available memory. That's probably because of differing behaviour of autovacuum/vacuum, which sometime will have to do several scans of the tables if there are too many dead tuples. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Andres, I don't want to post a full series right now, but my working state is available on http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush Below the results of a lot of tests with pgbench to exercise checkpoints on the above version when fetched. Overall comments: - sorting & flushing is basically always a winner - benchmarking with short runs on large databases is a bad idea the results are very different if a longer run is used (see andres00b vs andres00c) # HOST/SOFT 16 GB 2 cpu 8 cores 200 GB RAID1 HDD, ext4 FS Ubuntu 12.04 LTS (precise) # ABOUT THE REPORTED STATISTICS tps: is the "excluding connection" time tps, the higher the better 1-sec tps: average of measured per-second tps note - it should be the same as the previous one, but due to various hazards in the trace, especially when things go badly and pg get stuck, it may be different. Such hazard also explain why there may be some non-integer tps reported for some seconds. stddev: standard deviation, the lower the better the five figures in bracket give a feel of the distribution: - min: minimal per-second tps seen in the trace - q1: first quarter per-second tps seen in the trace - med: median per-second tps seen in the trace - q3: third quarter per-second tps seen in the trace - max: maximal per-second tps seen in the trace the last percentage dubbed "<=10.0" is percent of seconds where performance is below 10 tps: this measures of how unresponsive pg was during the run ## TINY2 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4 with scale = 10 (~ 200 MB) postgresql.conf: shared_buffers = 1GB max_wal_size = 1GB checkpoint_timeout = 300s checkpoint_completion_target = 0.8 checkpoint_flush_after = { none, 0, 32, 64 } opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0 head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0% 1 | 2575.0 / 2575.1 ± 359.3 [ 1.0, 2595.9, 2712.0, 2732.0, 2847.0] 0.1% 2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0% 0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0% 1 | 2596.6 / 2596.9 ± 307.4 [296.0, 2590.5, 2707.9, 2738.0, 2847.8] 0.0% 2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0% 32 0 | 2625.5 / 2625.5 ± 250.5 [ 1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1% 1 | 2630.2 / 2630.2 ± 243.1 [301.8, 2654.9, 2697.2, 2726.0, 2837.4] 0.0% 2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0% 64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0% 1 | 2591.1 / 2591.1 ± 305.2 [455.9, 2608.9, 2680.2, 2734.1, 2859.0] 0.0% 2 | 2047.8 / 2046.4 ± 925.8 [ 0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ? Pretty small setup, all data fit in buffers. Good tps performance all around (best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction in tps stddev. ## SMALL pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4 with scale = 120 (~ 2 GB) postgresql.conf: shared_buffers = 2GB checkpoint_timeout = 300s checkpoint_completion_target = 0.8 checkpoint_flush_after = { none, 0, 32, 64 } opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0 head 0 | 209.2 / 204.2 ± 516.5 [0.0, 0.0, 4.0,5.0, 2251.0] 82.3% 1 | 207.4 / 204.2 ± 518.7 [0.0, 0.0, 4.0,5.0, 2245.1] 82.3% 2 | 217.5 / 211.0 ± 530.3 [0.0, 0.0, 3.0,5.0, 2255.0] 82.0% 3 | 217.8 / 213.2 ± 531.7 [0.0, 0.0, 4.0,6.0, 2261.9] 81.7% 4 | 230.7 / 223.9 ± 542.7 [0.0, 0.0, 4.0,7.0, 2282.0] 80.7% 0 0 | 734.8 / 735.5 ± 879.9 [0.0, 1.0, 16.5, 1748.3, 2281.1] 47.0% 1 | 694.9 / 693.0 ± 849.0 [0.0, 1.0, 29.5, 1545.7, 2428.0] 46.4% 2 | 735.3 / 735.5 ± 888.4 [0.0, 0.0, 12.0, 1781.2, 2312.1] 47.9% 3 | 736.0 / 737.5 ± 887.1 [0.0, 1.0, 16.0, 1794.3, 2317.0] 47.5% 4 | 734.9 / 735.1 ± 885.1 [0.0, 1.0, 15.5, 1781.0, 2297.1] 47.2% 32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0, 753.0, 2312.1] 0.2% 1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5, 671.0, 744.0, 2319.0] 0.1% 2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0, 756.0, 2331.0] 0.2% 3 | 744.1 / 744.1 ± 414.4 [0.0, 555.5, 685.2, 758.0, 2285.1] 0.1% 4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0, 759.0, 2308.1] 0.1% 64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0, 759.0, 2353.0] 0.1% 1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2, 680.0, 758.2, 2296.0] 0.1% 2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1, 757.0, 2310.0] 0.1% 3 | 529.0 / 526.6 ± 410.9 [0.0, 245.0, 444.0, 701.0, 2380.9] 1.5% ?? 4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0, 754.0, 2298.0] 0.1% Sorting brings * 3.3 tps, flushing significantly reduces tps
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote: > I've looked at these patches, especially the whole bench of explanations and > comments which is a good source for understanding what is going on in the > WAL writer, a part of pg I'm not familiar with. > > When reading the patch 0002 explanations, I had the following comments: > > AFAICS, there are several levels of actions when writing things in pg: > > 0: the thing is written in some internal buffer > > 1: the buffer is advised to be passed to the OS (hint bits?) Hint bits aren't related to OS writes. They're about information like 'this transaction committed' or 'all tuples on this page are visible'. > 2: the buffer is actually passed to the OS (write, flush) > > 3: the OS is advised to send the written data to the io subsystem > (sync_file_range with SYNC_FILE_RANGE_WRITE) > > 4: the OS is required to send the written data to the disk > (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER) We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) - the guarantees it gives aren't well defined, and actually changed across releases. 0002 is about something different, it's about the WAL writer. Which writes WAL to disk, so individual backends don't have to. It does so in the background every wal_writer_delay or whenever a tranasaction asynchronously commits. The reason this interacts with checkpoint flushing is that, when we flush writes on a regular pace, the writes by the checkpointer happen inbetween the very frequent writes/fdatasync() by the WAL writer. That means the disk's caches are flushed every fdatasync() - which causes considerable slowdowns. On a decent SSD the WAL writer, before this patch, often did 500-1000 fdatasync()s a second; the regular sync_file_range calls slowed down things too much. That's what caused the large regression when using checkpoint sorting/flushing with synchronous_commit=off. With that fixed - often a performance improvement on its own - I don't see that regression anymore. > After more considerations, my final understanding is that this behavior only > occurs with "asynchronous commit", aka a situation when COMMIT does not wait > for data to be really fsynced, but the fsync is to occur within some delay > so it will not be too far away, some kind of compromise for performance > where commits can be lost. Right. > Now all this is somehow alien to me because the whole point of committing is > having the data to disk, and I would not consider a database to be safe if > commit does not imply fsync, but I understand that people may have to > compromise for performance. It's obviously not applicable for every scenario, but in a *lot* of real-world scenario a sub-second loss window doesn't have any actual negative implications. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-11 19:44:25 +0100, Andres Freund wrote: > The first two commits of the series are pretty close to being ready. I'd > welcome review of those, and I plan to commit them independently of the > rest as they're beneficial independently. The most important bits are > the comments and docs of 0002 - they weren't particularly good > beforehand, so I had to rewrite a fair bit. > > 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the > potential regressions of 0002 > 0002: Fix the overaggressive flushing by the wal writer, by only > flushing every wal_writer_delay ms or wal_writer_flush_after > bytes. I've pushed these after some more polishing, now working on the next two. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Andres, 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the potential regressions of 0002 0002: Fix the overaggressive flushing by the wal writer, by only flushing every wal_writer_delay ms or wal_writer_flush_after bytes. I've looked at these patches, especially the whole bench of explanations and comments which is a good source for understanding what is going on in the WAL writer, a part of pg I'm not familiar with. When reading the patch 0002 explanations, I had the following comments: AFAICS, there are several levels of actions when writing things in pg: 0: the thing is written in some internal buffer 1: the buffer is advised to be passed to the OS (hint bits?) 2: the buffer is actually passed to the OS (write, flush) 3: the OS is advised to send the written data to the io subsystem (sync_file_range with SYNC_FILE_RANGE_WRITE) 4: the OS is required to send the written data to the disk (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER) It is not clear when reading the text which level is discussed. In particular, I'm not sure that "flush" refers to level 2, which is misleading. When reading the description, I'm rather under the impression that it is about level 4, but then if actual fsync are performed every 200 ms then the tps would be very low... After more considerations, my final understanding is that this behavior only occurs with "asynchronous commit", aka a situation when COMMIT does not wait for data to be really fsynced, but the fsync is to occur within some delay so it will not be too far away, some kind of compromise for performance where commits can be lost. Now all this is somehow alien to me because the whole point of committing is having the data to disk, and I would not consider a database to be safe if commit does not imply fsync, but I understand that people may have to compromise for performance. Is my understanding right? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On Thu, Feb 11, 2016 at 1:44 PM, Andres Freundwrote: > On 2016-02-04 16:54:58 +0100, Andres Freund wrote: >> Fabien asked me to post a new version of the checkpoint flushing patch >> series. While this isn't entirely ready for commit, I think we're >> getting closer. >> >> I don't want to post a full series right now, but my working state is >> available on >> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush >> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush > > The first two commits of the series are pretty close to being ready. I'd > welcome review of those, and I plan to commit them independently of the > rest as they're beneficial independently. The most important bits are > the comments and docs of 0002 - they weren't particularly good > beforehand, so I had to rewrite a fair bit. > > 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the > potential regressions of 0002 > 0002: Fix the overaggressive flushing by the wal writer, by only > flushing every wal_writer_delay ms or wal_writer_flush_after > bytes. I previously reviewed 0001 and I think it's fine. I haven't reviewed 0002 in detail, but I like the concept. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-04 16:54:58 +0100, Andres Freund wrote: > Fabien asked me to post a new version of the checkpoint flushing patch > series. While this isn't entirely ready for commit, I think we're > getting closer. > > I don't want to post a full series right now, but my working state is > available on > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush > git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush The first two commits of the series are pretty close to being ready. I'd welcome review of those, and I plan to commit them independently of the rest as they're beneficial independently. The most important bits are the comments and docs of 0002 - they weren't particularly good beforehand, so I had to rewrite a fair bit. 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the potential regressions of 0002 0002: Fix the overaggressive flushing by the wal writer, by only flushing every wal_writer_delay ms or wal_writer_flush_after bytes. Greetings, Andres Freund >From f3bc3a7c40c21277331689595814b359c55682dc Mon Sep 17 00:00:00 2001 From: Andres FreundDate: Thu, 11 Feb 2016 19:34:29 +0100 Subject: [PATCH 1/6] Allow SetHintBits() to succeed if the buffer's LSN is new enough. Previously we only allowed SetHintBits() to succeed if the commit LSN of the last transaction touching the page has already been flushed to disk. We can't generally change the LSN of the page, because we don't necessarily have the required locks on the page. But the required LSN interlock does not require the commit record to be flushed, it just requires that the commit record will be flushed before the page is written out. Therefore if the buffer LSN is newer than the commit LSN, the hint bit can be safely set. In a number of scenarios (e.g. pgbench) this noticeably increases the number of hint bits are set. But more importantly it also keeps the success rate up when flushing WAL less frequently. That was the original reason for commit 4de82f7d7, which has negative performance consequences in a number of scenarios. This will allow a follup commit to reduce the flush rate. Discussion: 20160118163908.gw10...@awork2.anarazel.de --- src/backend/utils/time/tqual.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index 465933d..503bd1d 100644 --- a/src/backend/utils/time/tqual.c +++ b/src/backend/utils/time/tqual.c @@ -89,12 +89,13 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); * Set commit/abort hint bits on a tuple, if appropriate at this time. * * It is only safe to set a transaction-committed hint bit if we know the - * transaction's commit record has been flushed to disk, or if the table is - * temporary or unlogged and will be obliterated by a crash anyway. We - * cannot change the LSN of the page here because we may hold only a share - * lock on the buffer, so we can't use the LSN to interlock this; we have to - * just refrain from setting the hint bit until some future re-examination - * of the tuple. + * transaction's commit record is guaranteed to be flushed to disk before the + * buffer, or if the table is temporary or unlogged and will be obliterated by + * a crash anyway. We cannot change the LSN of the page here because we may + * hold only a share lock on the buffer, so we can only use the LSN to + * interlock this if the buffer's LSN already is newer than the commit LSN; + * otherwise we have to just refrain from setting the hint bit until some + * future re-examination of the tuple. * * We can always set hint bits when marking a transaction aborted. (Some * code in heapam.c relies on that!) @@ -122,8 +123,12 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer, /* NB: xid must be known committed here! */ XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); - if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)) - return;/* not flushed yet, so don't set hint */ + if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) && + BufferGetLSNAtomic(buffer) < commitLSN) + { + /* not flushed and no LSN interlock, so don't set hint */ + return; + } } tuple->t_infomask |= infomask; -- 2.7.0.229.g701fa7f >From e4facce2cf8b982408ff1de174cffc202852adfd Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 11 Feb 2016 19:34:29 +0100 Subject: [PATCH 2/6] Allow the WAL writer to flush WAL at a reduced rate. Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the likelihood that hint bits can be set quickly. More quickly set hint bits can reduce contention around the clog et al. But unfortunately the increased flush rate can have a significant negative performance impact, I have measured up to a factor of ~4. The reason for this slowdown is that if there are independent
Re: [HACKERS] checkpointer continuous flushing - V16
I think I would appreciate comments to understand why/how the ringbuffer is used, and more comments in general, so it is fine if you improve this part. I'd suggest to leave out the ringbuffer/new bgwriter parts. Ok, so the patch would only onclude the checkpointer stuff. I'll look at this part in detail. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHOwrote: > >>> I think I would appreciate comments to understand why/how the >>> ringbuffer is used, and more comments in general, so it is fine if >you >>> improve this part. >> >> I'd suggest to leave out the ringbuffer/new bgwriter parts. > >Ok, so the patch would only onclude the checkpointer stuff. > >I'll look at this part in detail. Yes, that's the more pressing part. I've seen pretty good results with the new bgwriter, but it's not really worthwhile until sorting and flushing is in... Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hi Fabien, On 2016-02-04 16:54:58 +0100, Andres Freund wrote: > I don't want to post a full series right now, but my working state is > available on > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush > git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush > > The main changes are that: > 1) the significant performance regressions I saw are addressed by >changing the wal writer flushing logic > 2) The flushing API moved up a couple layers, and now deals with buffer >tags, rather than the physical files > 3) Writes from checkpoints, bgwriter and files are flushed, configurable >by individual GUCs. Without that I still saw the spiked in a lot of > circumstances. > > There's also a more experimental reimplementation of bgwriter, but I'm > not sure it's realistic to polish that up within the constraints of 9.6. Any comments before I spend more time polishing this? I'm currently updating docs and comments to actually describe the current state... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Andres, Any comments before I spend more time polishing this? I'm running tests on various settings, I'll send a report when it is done. Up to now the performance seems as good as with the previous version. I'm currently updating docs and comments to actually describe the current state... I did notice the mismatched documentation. I think I would appreciate comments to understand why/how the ringbuffer is used, and more comments in general, so it is fine if you improve this part. Minor details: "typedefs.list" should be updated to WritebackContext. "WritebackContext" is a typedef, "struct" is not needed. I'll look at the code more deeply probably over next weekend. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote: > I think I would appreciate comments to understand why/how the ringbuffer is > used, and more comments in general, so it is fine if you improve this part. I'd suggest to leave out the ringbuffer/new bgwriter parts. I think they'd be committed separately, and probably not in 9.6. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hi, Fabien asked me to post a new version of the checkpoint flushing patch series. While this isn't entirely ready for commit, I think we're getting closer. I don't want to post a full series right now, but my working state is available on http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush The main changes are that: 1) the significant performance regressions I saw are addressed by changing the wal writer flushing logic 2) The flushing API moved up a couple layers, and now deals with buffer tags, rather than the physical files 3) Writes from checkpoints, bgwriter and files are flushed, configurable by individual GUCs. Without that I still saw the spiked in a lot of circumstances. There's also a more experimental reimplementation of bgwriter, but I'm not sure it's realistic to polish that up within the constraints of 9.6. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
This patch got its fair share of reviewer attention this commitfest. Moving to the next one. Andres, if you want to commit ahead of time you're of course encouraged to do so. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On Wed, Jan 20, 2016 at 9:02 AM, Andres Freundwrote: > Chatting on IM with Heikki, I noticed that we're pretty pessimistic in > SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN), > because we can't easily set the LSN. But, it's actually fairly common > that the pages LSN is already newer than the commitLSN - in which case > we, afaics, just can go ahead and set the hint bit, no? > > So, instead of > if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer) > return; /* not flushed yet, > so don't set hint */ > we do > if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) > && BufferGetLSNAtomic(buffer) < commitLSN) > return; /* not flushed yet, > so don't set hint */ > > In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers > a large portion of the hint writes that we currently skip. Dang. That's a really good idea. Although I think you'd probably better revise the comment, since it will otherwise be false. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-21 11:33:15 +0530, Amit Kapila wrote: > On Wed, Jan 20, 2016 at 9:07 PM, Andres Freundwrote: > > I don't think it's strongly related - the contention here is on read > > access to the clog, not on write access. > > Aren't reads on clog contended with parallel writes to clog? Sure. But you're not going to beat "no access to the clog" due to hint bits, by making parallel writes a bit better citizens. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On Wed, Jan 20, 2016 at 9:07 PM, Andres Freundwrote: > > On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > > > The relevant thread is at > > > http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com > > > what I didn't remember is that I voiced concern back then about exactly this: > > > http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de > > > ;) > > > > Interesting. If we consider for a minute that part of the cause for the > > slowdown is slowness in pg_clog, maybe we should reconsider the initial > > decision to flush as quickly as possible (i.e. adopt a strategy where > > walwriter sleeps a bit between two flushes) in light of the group-update > > feature for CLOG being proposed by Amit Kapila in another thread -- it > > seems that these things might go hand-in-hand. > > I don't think it's strongly related - the contention here is on read > access to the clog, not on write access. Aren't reads on clog contended with parallel writes to clog? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] checkpointer continuous flushing
Andres Freund wrote: > The relevant thread is at > http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com > what I didn't remember is that I voiced concern back then about exactly this: > http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de > ;) Interesting. If we consider for a minute that part of the cause for the slowdown is slowness in pg_clog, maybe we should reconsider the initial decision to flush as quickly as possible (i.e. adopt a strategy where walwriter sleeps a bit between two flushes) in light of the group-update feature for CLOG being proposed by Amit Kapila in another thread -- it seems that these things might go hand-in-hand. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > The relevant thread is at > > http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com > > what I didn't remember is that I voiced concern back then about exactly > > this: > > http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de > > ;) > > Interesting. If we consider for a minute that part of the cause for the > slowdown is slowness in pg_clog, maybe we should reconsider the initial > decision to flush as quickly as possible (i.e. adopt a strategy where > walwriter sleeps a bit between two flushes) in light of the group-update > feature for CLOG being proposed by Amit Kapila in another thread -- it > seems that these things might go hand-in-hand. I don't think it's strongly related - the contention here is on read access to the clog, not on write access. While Amit's patch will reduce the impact of that a bit, I don't see it making a fundamental difference. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-19 22:43:21 +0100, Andres Freund wrote: > On 2016-01-19 12:58:38 -0500, Robert Haas wrote: > > This seems like a problem with the WAL writer quite independent of > > anything else. It seems likely to be inadvertent fallout from this > > patch: > > > > Author: Simon Riggs> > Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 + > > > > Wakeup WALWriter as needed for asynchronous commit performance. > > Previously we waited for wal_writer_delay before flushing WAL. Now > > we also wake WALWriter as soon as a WAL buffer page has filled. > > Significant effect observed on performance of asynchronous commits > > by Robert Haas, attributed to the ability to set hint bits on tuples > > earlier and so reducing contention caused by clog lookups. > > In addition to that the "powersaving" effort also plays a role - without > the latch we'd not wake up at any meaningful rate at all atm. The relevant thread is at http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com what I didn't remember is that I voiced concern back then about exactly this: http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de ;) Simon: CCed you, as the author of the above commit. Quick summary: The frequent wakeups of wal writer can lead to significant performance regressions in workloads that are bigger than shared_buffers, because the super-frequent fdatasync()s by the wal writer slow down concurrent writes (bgwriter, checkpointer, individual backend writes) dramatically. To the point that SIGSTOPing the wal writer gets a pgbench workload from 2995 to 10887 tps. The reasons fdatasyncs cause a slow down is that it prevents real use of queuing to the storage devices. On 2016-01-19 22:43:21 +0100, Andres Freund wrote: > On 2016-01-19 12:58:38 -0500, Robert Haas wrote: > > If I understand correctly, prior to that commit, WAL writer woke up 5 > > times per second and flushed just that often (unless you changed the > > default settings).But as the commit message explained, that turned > > out to suck - you could make performance go up very significantly by > > radically decreasing wal_writer_delay. This commit basically lets it > > flush at maximum velocity - as fast as we finish one flush, we can > > start the next. That must have seemed like a win at the time from the > > way the commit message was written, but you seem to now be seeing the > > opposite effect, where performance is suffering because flushes are > > too frequent rather than too infrequent. I wonder if there's an ideal > > flush rate and what it is, and how much it depends on what hardware > > you have got. > > I think the problem isn't really that it's flushing too much WAL in > total, it's that it's flushing WAL in a too granular fashion. I suspect > we want something where we attempt a minimum number of flushes per > second (presumably tied to wal_writer_delay) and, once exceeded, a > minimum number of pages per flush. I think we even could continue to > write() the data at the same rate as today, we just would need to reduce > the number of fdatasync()s we issue. And possibly could make the > eventual fdatasync()s cheaper by hinting the kernel to write them out > earlier. > > Now the question what the minimum number of pages we want to flush for > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A > simple model would be to statically tie it to the size of wal_buffers; > say, don't flush unless at least 10% of XLogBuffers have been written > since the last flush. More complex approaches would be to measure the > continuous WAL writeout rate. > > By tying it to both a minimum rate under activity (ensuring things go to > disk fast) and a minimum number of pages to sync (ensuring a reasonable > number of cache flush operations) we should be able to mostly accomodate > the different types of workloads. I think. This unfortunately leaves out part of the reasoning for the above commit: We want WAL to be flushed fast, so we immediately can set hint bits. One, relatively extreme, approach would be to continue *writing* WAL in the background writer as today, but use rules like suggested above guiding the actual flushing. Additionally using operations like sync_file_range() (and equivalents on other OSs). Then, to address the regression of SetHintBits() having to bail out more often, actually trigger a WAL flush whenever WAL is already written, but not flushed. has the potential to be bad in a number of other cases tho :( Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-20 11:13:26 +0100, Andres Freund wrote: > On 2016-01-19 22:43:21 +0100, Andres Freund wrote: > > On 2016-01-19 12:58:38 -0500, Robert Haas wrote: > > I think the problem isn't really that it's flushing too much WAL in > > total, it's that it's flushing WAL in a too granular fashion. I suspect > > we want something where we attempt a minimum number of flushes per > > second (presumably tied to wal_writer_delay) and, once exceeded, a > > minimum number of pages per flush. I think we even could continue to > > write() the data at the same rate as today, we just would need to reduce > > the number of fdatasync()s we issue. And possibly could make the > > eventual fdatasync()s cheaper by hinting the kernel to write them out > > earlier. > > > > Now the question what the minimum number of pages we want to flush for > > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A > > simple model would be to statically tie it to the size of wal_buffers; > > say, don't flush unless at least 10% of XLogBuffers have been written > > since the last flush. More complex approaches would be to measure the > > continuous WAL writeout rate. > > > > By tying it to both a minimum rate under activity (ensuring things go to > > disk fast) and a minimum number of pages to sync (ensuring a reasonable > > number of cache flush operations) we should be able to mostly accomodate > > the different types of workloads. I think. > > This unfortunately leaves out part of the reasoning for the above > commit: We want WAL to be flushed fast, so we immediately can set hint > bits. > > One, relatively extreme, approach would be to continue *writing* WAL in > the background writer as today, but use rules like suggested above > guiding the actual flushing. Additionally using operations like > sync_file_range() (and equivalents on other OSs). Then, to address the > regression of SetHintBits() having to bail out more often, actually > trigger a WAL flush whenever WAL is already written, but not flushed. > has the potential to be bad in a number of other cases tho :( Chatting on IM with Heikki, I noticed that we're pretty pessimistic in SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN), because we can't easily set the LSN. But, it's actually fairly common that the pages LSN is already newer than the commitLSN - in which case we, afaics, just can go ahead and set the hint bit, no? So, instead of if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer) return; /* not flushed yet, so don't set hint */ we do if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) && BufferGetLSNAtomic(buffer) < commitLSN) return; /* not flushed yet, so don't set hint */ In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers a large portion of the hint writes that we currently skip. Right now, on my laptop, I get (-M prepared -c 32 -j 32): current wal-writer 12827 tps, 95 % IO util, 93 % CPU no flushing in wal writer * 13185 tps, 46 % IO util, 93 % CPU no flushing in wal writer & above change16366 tps, 41 % IO util, 95 % CPU flushing in wal writer & above change: 14812 tps, 94 % IO util, 95 % CPU * sometimes the results initially were much lower, with lots of lock contention. Can't figure out why that's only sometimes the case. In those cases the results were more like 8967 tps. these aren't meant as thorough benchmarks, just to provide some orientation. Now that solution won't improve every situation, e.g. for a workload that inserts a lot of rows in one transaction, and only does inserts, it probably won't do all that much. But it still seems like a pretty good mitigation strategy. I hope that with a smarter write strategy (getting that 50% reduction in IO util) and the above we should be ok. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
I measured it in a different number of cases, both on SSDs and spinning rust. I just reproduced it with: postgres-ckpt14 \ -D /srv/temp/pgdev-dev-800/ \ -c maintenance_work_mem=2GB \ -c fsync=on \ -c synchronous_commit=off \ -c shared_buffers=2GB \ -c wal_level=hot_standby \ -c max_wal_senders=10 \ -c max_wal_size=100GB \ -c checkpoint_timeout=30s Using a fresh cluster each time (copied from a "template" to save time) and using pgbench -M prepared -c 16 -j 16 -T 300 -P 1 I must say that I have not succeeded in reproducing any significant regression up to now on an HDD. I'm running some more tests again because I had left out some options above that I thought were non essential. I have deep problems with the 30-second checkpoint tests: basically the checkpoints take much more than 30 seconds to complete, the system is not stable, the 300 seconds runs last more than 900 seconds because the clients are stuck a long time. The overall behavior is appaling as most of the time is spent in IO panic at 0 tps. Also, the performance level is around 160 tps on HDDs, which make sense to me for a 7200 rpm HDD capable of about x00 random writes per second. It seems to me that you reported much better performance on HDD, but I cannot really see how this would be possible if data are indeed writen to disk. Any idea? Also, what is the very precise postgres version & patch used in your tests on HDDs? both before/after patch are higher) if I disable full_page_writes, thereby eliminating a lot of other IO. Maybe this is an explanation -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-19 10:27:31 +0100, Fabien COELHO wrote: > Also, the performance level is around 160 tps on HDDs, which make sense to > me for a 7200 rpm HDD capable of about x00 random writes per second. It > seems to me that you reported much better performance on HDD, but I cannot > really see how this would be possible if data are indeed writen to disk. Any > idea? synchronous_commit = off does make a significant difference. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-19 12:58:38 -0500, Robert Haas wrote: > This seems like a problem with the WAL writer quite independent of > anything else. It seems likely to be inadvertent fallout from this > patch: > > Author: Simon Riggs> Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 + > > Wakeup WALWriter as needed for asynchronous commit performance. > Previously we waited for wal_writer_delay before flushing WAL. Now > we also wake WALWriter as soon as a WAL buffer page has filled. > Significant effect observed on performance of asynchronous commits > by Robert Haas, attributed to the ability to set hint bits on tuples > earlier and so reducing contention caused by clog lookups. In addition to that the "powersaving" effort also plays a role - without the latch we'd not wake up at any meaningful rate at all atm. > If I understand correctly, prior to that commit, WAL writer woke up 5 > times per second and flushed just that often (unless you changed the > default settings).But as the commit message explained, that turned > out to suck - you could make performance go up very significantly by > radically decreasing wal_writer_delay. This commit basically lets it > flush at maximum velocity - as fast as we finish one flush, we can > start the next. That must have seemed like a win at the time from the > way the commit message was written, but you seem to now be seeing the > opposite effect, where performance is suffering because flushes are > too frequent rather than too infrequent. I wonder if there's an ideal > flush rate and what it is, and how much it depends on what hardware > you have got. I think the problem isn't really that it's flushing too much WAL in total, it's that it's flushing WAL in a too granular fashion. I suspect we want something where we attempt a minimum number of flushes per second (presumably tied to wal_writer_delay) and, once exceeded, a minimum number of pages per flush. I think we even could continue to write() the data at the same rate as today, we just would need to reduce the number of fdatasync()s we issue. And possibly could make the eventual fdatasync()s cheaper by hinting the kernel to write them out earlier. Now the question what the minimum number of pages we want to flush for (setting wal_writer_delay triggered ones aside) isn't easy to answer. A simple model would be to statically tie it to the size of wal_buffers; say, don't flush unless at least 10% of XLogBuffers have been written since the last flush. More complex approaches would be to measure the continuous WAL writeout rate. By tying it to both a minimum rate under activity (ensuring things go to disk fast) and a minimum number of pages to sync (ensuring a reasonable number of cache flush operations) we should be able to mostly accomodate the different types of workloads. I think. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On Mon, Jan 18, 2016 at 11:39 AM, Andres Freundwrote: > On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote: >> Hello Andres, >> >> >I measured it in a different number of cases, both on SSDs and spinning >> >rust. I just reproduced it with: >> > >> >postgres-ckpt14 \ >> > -D /srv/temp/pgdev-dev-800/ \ >> > -c maintenance_work_mem=2GB \ >> > -c fsync=on \ >> > -c synchronous_commit=off \ >> > -c shared_buffers=2GB \ >> > -c wal_level=hot_standby \ >> > -c max_wal_senders=10 \ >> > -c max_wal_size=100GB \ >> > -c checkpoint_timeout=30s >> > >> >Using a fresh cluster each time (copied from a "template" to save time) >> >and using >> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1 > > So, I've analyzed the problem further, and I think I found something > rater interesting. I'd profiled the kernel looking where it blocks in > the IO request queues, and found that the wal writer was involved > surprisingly often. > > So, in a workload where everything (checkpoint, bgwriter, backend > writes) is flushed: 2995 tps > After I kill the wal writer with -STOP: 10887 tps > > Stracing the wal writer shows: > > 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, > si_uid=1000} --- > 17:29:02.001538 rt_sigreturn({mask=[]}) = 0 > 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.001615 write(3, > "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., > 49152) = 49152 > 17:29:02.001671 fdatasync(3)= 0 > 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, > si_uid=1000} --- > 17:29:02.005043 rt_sigreturn({mask=[]}) = 0 > 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.005111 write(3, > "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., > 8192) = 8192 > 17:29:02.005147 fdatasync(3)= 0 > 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, > si_uid=1000} --- > 17:29:02.008705 rt_sigreturn({mask=[]}) = 0 > 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 > \331_/\0\0\0\267\30\0\0\0\0\0\0"..., 98304) = 98304 > 17:29:02.008822 fdatasync(3)= 0 > 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, > si_uid=1000} --- > 17:29:02.016141 rt_sigreturn({mask=[]}) = 0 > 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.016204 write(3, > "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., > 57344) = 57344 > 17:29:02.016281 fdatasync(3)= 0 > 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, > si_uid=1000} --- > 17:29:02.019199 rt_sigreturn({mask=[]}) = 0 > 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily > unavailable) > 17:29:02.019249 write(3, > "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0"..., 73728) > = 73728 > 17:29:02.019355 fdatasync(3)= 0 > 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, > si_uid=1000} --- > 17:29:02.022696 rt_sigreturn({mask=[]}) = 0 > > I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a > second. As soon as the wal writer is stopped, it's much bigger chunks, > on the order of 50-130 pages. And, not that surprisingly, that improves > performance, because there's far fewer cache flushes submitted to the > hardware. This seems like a problem with the WAL writer quite independent of anything else. It seems likely to be inadvertent fallout from this patch: Author: Simon Riggs Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 + Wakeup WALWriter as needed for asynchronous commit performance. Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups. If I understand correctly, prior to that commit, WAL writer woke up 5 times per second and flushed just that often (unless you changed the default settings).But as the commit message explained, that turned out to suck - you could make performance go up very significantly by radically decreasing wal_writer_delay. This commit basically lets it flush at maximum velocity - as fast as we finish one flush, we can start the next. That must have seemed like a win at the time from the way the commit message was written, but you seem to now be seeing the opposite effect, where performance is suffering
Re: [HACKERS] checkpointer continuous flushing
synchronous_commit = off does make a significant difference. Sure, but I had thought about that and kept this one... But why are you then saying this is fundamentally limited to 160 xacts/sec? I'm just saying that the tested load generates mostly random IOs (probably on average over 1 page per transaction), random IOs are very slow on a HDD, so I do not expect great tps. I think I found one possible culprit: I automatically wrote 300 seconds for checkpoint_timeout, instead of 30 seconds in your settings. I'll have to rerun the tests with this (unreasonnable) figure to check whether I really get a regression. I've not seen meaningful changes in the size of the regression between 30/300s. At 300 seconds (5 minutes) the checkpoints of the accumulated takes 15-25 minutes, during which the database is mostly offline, and there is no clear difference with/without sort+flush. Other tests I ran with "reasonnable" settings on a large (scale=800) db did not show any significant performance regression, up to now. Try running it so that the data set nearly, but not entirely fit into the OS page cache, while definitely not fitting into shared_buffers. The scale=800 just worked for that on my hardware, no idea how it's for yours. That seems to be the point where the effect is the worst. I have 16GB memory on the tested host, same as your hardware I think, so I use scale 800 => 12GB at the beginning of the run. Not sure it fits the bill as I think it fits in memory, so the load is mostly write and no/very few reads. I'll also try with scale 1000. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
synchronous_commit = off does make a significant difference. Sure, but I had thought about that and kept this one... I think I found one possible culprit: I automatically wrote 300 seconds for checkpoint_timeout, instead of 30 seconds in your settings. I'll have to rerun the tests with this (unreasonnable) figure to check whether I really get a regression. Other tests I ran with "reasonnable" settings on a large (scale=800) db did not show any significant performance regression, up to know. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-19 13:34:14 +0100, Fabien COELHO wrote: > > >synchronous_commit = off does make a significant difference. > > Sure, but I had thought about that and kept this one... But why are you then saying this is fundamentally limited to 160 xacts/sec? > I think I found one possible culprit: I automatically wrote 300 seconds for > checkpoint_timeout, instead of 30 seconds in your settings. I'll have to > rerun the tests with this (unreasonnable) figure to check whether I really > get a regression. I've not seen meaningful changes in the size of the regression between 30/300s. > Other tests I ran with "reasonnable" settings on a large (scale=800) db did > not show any significant performance regression, up to know. Try running it so that the data set nearly, but not entirely fit into the OS page cache, while definitely not fitting into shared_buffers. The scale=800 just worked for that on my hardware, no idea how it's for yours. That seems to be the point where the effect is the worst. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote: > > Hello Andres, > > >I measured it in a different number of cases, both on SSDs and spinning > >rust. I just reproduced it with: > > > >postgres-ckpt14 \ > > -D /srv/temp/pgdev-dev-800/ \ > > -c maintenance_work_mem=2GB \ > > -c fsync=on \ > > -c synchronous_commit=off \ > > -c shared_buffers=2GB \ > > -c wal_level=hot_standby \ > > -c max_wal_senders=10 \ > > -c max_wal_size=100GB \ > > -c checkpoint_timeout=30s > > > >Using a fresh cluster each time (copied from a "template" to save time) > >and using > >pgbench -M prepared -c 16 -j 16 -T 300 -P 1 So, I've analyzed the problem further, and I think I found something rater interesting. I'd profiled the kernel looking where it blocks in the IO request queues, and found that the wal writer was involved surprisingly often. So, in a workload where everything (checkpoint, bgwriter, backend writes) is flushed: 2995 tps After I kill the wal writer with -STOP: 10887 tps Stracing the wal writer shows: 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} --- 17:29:02.001538 rt_sigreturn({mask=[]}) = 0 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152 17:29:02.001671 fdatasync(3)= 0 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} --- 17:29:02.005043 rt_sigreturn({mask=[]}) = 0 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192 17:29:02.005147 fdatasync(3)= 0 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} --- 17:29:02.008705 rt_sigreturn({mask=[]}) = 0 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0"..., 98304) = 98304 17:29:02.008822 fdatasync(3)= 0 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} --- 17:29:02.016141 rt_sigreturn({mask=[]}) = 0 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344 17:29:02.016281 fdatasync(3)= 0 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} --- 17:29:02.019199 rt_sigreturn({mask=[]}) = 0 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable) 17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0"..., 73728) = 73728 17:29:02.019355 fdatasync(3)= 0 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} --- 17:29:02.022696 rt_sigreturn({mask=[]}) = 0 I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a second. As soon as the wal writer is stopped, it's much bigger chunks, on the order of 50-130 pages. And, not that surprisingly, that improves performance, because there's far fewer cache flushes submitted to the hardware. > I'm running some tests similar to those above... > Do you do some warmup when testing? I guess the answer is "no". Doesn't make a difference here, I tried both. As long as before/after benchmarks start from the same state... > I understand that you have 8 cores/16 threads on your host? On one of them, 4 cores/8 threads on the laptop. > Loading scale 800 data for 300 seconds tests takes much more than 300 > seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds > checkpoint cycles and without any warmup, I feel that these tests are really > on the very short (too short) side, so I'm not sure how much I can trust > such results as significant. The data I reported were with more real life > like parameters. I see exactly the same with 300s or 1000s checkpoint cycles, it just takes a lot longer to repeat. They're also similar (although obviously both before/after patch are higher) if I disable full_page_writes, thereby eliminating a lot of other IO. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Andres, Hello Tomas. Ooops, sorry Andres, I mixed up the thread in my head so was not clear who was asking the questions to whom. I was/am using ext4, and it turns out that, when abling flushing, the results are hugely dependant on barriers=on/off, with the latter making flushing rather advantageous. Additionally data=ordered/writeback makes measureable difference too. These are very interesting tests, I'm looking forward to have a look at the results. The fact that these options change performance is expected. Personnaly the test I submitted on the thread used ext4 with default mount options plus "relatime". I confirm that: nothing special but "relatime" on ext4 on my test host. If I had a choice, I would tend to take the safest options, because the point of a database is to keep data safe. That's why I'm not found of the "synchronous_commit=off" chosen above. "found" -> "fond". I confirm this opinion. If you have BBU on you disk/raid system probably playing with some of these options is safe, though. Not the case with my basic hardware. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hello Andres, I measured it in a different number of cases, both on SSDs and spinning rust. I just reproduced it with: postgres-ckpt14 \ -D /srv/temp/pgdev-dev-800/ \ -c maintenance_work_mem=2GB \ -c fsync=on \ -c synchronous_commit=off \ -c shared_buffers=2GB \ -c wal_level=hot_standby \ -c max_wal_senders=10 \ -c max_wal_size=100GB \ -c checkpoint_timeout=30s Using a fresh cluster each time (copied from a "template" to save time) and using pgbench -M prepared -c 16 -j 16 -T 300 -P 1 I'm running some tests similar to those above... Do you do some warmup when testing? I guess the answer is "no". I understand that you have 8 cores/16 threads on your host? Loading scale 800 data for 300 seconds tests takes much more than 300 seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds checkpoint cycles and without any warmup, I feel that these tests are really on the very short (too short) side, so I'm not sure how much I can trust such results as significant. The data I reported were with more real life like parameters. Anyway, I'll have some results to show with a setting more or less similar to yours. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi Fabien, On 2016-01-11 14:45:16 +0100, Andres Freund wrote: > I measured it in a different number of cases, both on SSDs and spinning > rust. I just reproduced it with: > > postgres-ckpt14 \ > -D /srv/temp/pgdev-dev-800/ \ > -c maintenance_work_mem=2GB \ > -c fsync=on \ > -c synchronous_commit=off \ > -c shared_buffers=2GB \ > -c wal_level=hot_standby \ > -c max_wal_senders=10 \ > -c max_wal_size=100GB \ > -c checkpoint_timeout=30s What kernel, filesystem and filesystem option did you measure with? I was/am using ext4, and it turns out that, when abling flushing, the results are hugely dependant on barriers=on/off, with the latter making flushing rather advantageous. Additionally data=ordered/writeback makes measureable difference too. Reading kernel sources trying to understand some more of the performance impact. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing
Hi Fabien, Hello Tomas. On 2016-01-11 14:45:16 +0100, Andres Freund wrote: I measured it in a different number of cases, both on SSDs and spinning rust. I just reproduced it with: postgres-ckpt14 \ -D /srv/temp/pgdev-dev-800/ \ -c maintenance_work_mem=2GB \ -c fsync=on \ -c synchronous_commit=off \ -c shared_buffers=2GB \ -c wal_level=hot_standby \ -c max_wal_senders=10 \ -c max_wal_size=100GB \ -c checkpoint_timeout=30s What kernel, filesystem and filesystem option did you measure with? Andres did these measures, not me, so I do not know. I was/am using ext4, and it turns out that, when abling flushing, the results are hugely dependant on barriers=on/off, with the latter making flushing rather advantageous. Additionally data=ordered/writeback makes measureable difference too. These are very interesting tests, I'm looking forward to have a look at the results. The fact that these options change performance is expected. Personnaly the test I submitted on the thread used ext4 with default mount options plus "relatime". If I had a choice, I would tend to take the safest options, because the point of a database is to keep data safe. That's why I'm not found of the "synchronous_commit=off" chosen above. Reading kernel sources trying to understand some more of the performance impact. Wow! -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers