Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-03-07 09:41:51 -0800, Andres Freund wrote: > > Due to the difference in amount of RAM, each machine used different scales - > > the goal is to have small, ~50% RAM, >200% RAM sizes: > > > > 1) Xeon: 100, 400, 6000 > > 2) i5: 50, 200, 3000 > > > > The commits actually tested are > > > >cfafd8be (right before the first patch) > >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate. > >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ... > > Huh, now I'm a bit confused. These are the commits you tested? Those > aren't the ones doing sorting and flushing? To clarify: The reason we'd not expect to see much difference here is that the above commits really only have any affect above noise if you use synchronous_commit=off. Without async commit it's just one additional gettimeofday() call and a few additional branches in the wal writer every wal_writer_delay. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote: > 1) HP DL380 G5 (old rack server) > - 2x Xeon E5450, 16GB RAM (8 cores) > - 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC) > - RedHat 6 > - shared_buffers = 4GB > - min_wal_size = 2GB > - max_wal_size = 6GB > > 2) workstation with i5 CPU > - 1x i5-2500k, 8GB RAM > - 6x Intel S3700 100GB (in RAID0 for this benchmark) > - Gentoo > - shared_buffers = 2GB > - min_wal_size = 1GB > - max_wal_size = 8GB Thinking about with that hardware I'm not suprised if you're only seing small benefits. The amount of ram limits the amount of dirty data; and you have plenty have on-storage buffering in comparison to that. > Both machines were using the same kernel version 4.4.2 and default io > scheduler (cfq). The > > The test procedure was quite simple - pgbench with three different scales, > for each scale three runs, 1h per run (and 30 minutes of warmup before each > run). > > Due to the difference in amount of RAM, each machine used different scales - > the goal is to have small, ~50% RAM, >200% RAM sizes: > > 1) Xeon: 100, 400, 6000 > 2) i5: 50, 200, 3000 > > The commits actually tested are > >cfafd8be (right before the first patch) >7975c5e0 Allow the WAL writer to flush WAL at a reduced rate. >db76b1ef Allow SetHintBits() to succeed if the buffer's LSN ... Huh, now I'm a bit confused. These are the commits you tested? Those aren't the ones doing sorting and flushing? > Also, I really wonder what will happen with non-default io schedulers. I > believe all the testing so far was done with cfq, so what happens on > machines that use e.g. "deadline" (as many DB machines actually do)? deadline and noop showed slightly bigger benefits in my testing. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Tomas, One of the goals of this thread (as I understand it) was to make the overall behavior smoother - eliminate sudden drops in transaction rate due to bursts of random I/O etc. One way to look at this is in terms of how much the tps fluctuates, so let's see some charts. I've collected per-second tps measurements (using the aggregation built into pgbench) but looking at that directly is pretty pointless because it's very difficult to compare two noisy lines jumping up and down. So instead let's see CDF of the per-second tps measurements. I.e. we have 3600 tps measurements, and given a tps value the question is what percentage of the measurements is below this value. y = Probability(tps <= x) We prefer higher values, and the ideal behavior would be that we get exactly the same tps every second. Thus an ideal CDF line would be a step line. Of course, that's rarely the case in practice. But comparing two CDF curves is easy - the line more to the right is better, at least for tps measurements, where we prefer higher values. Very nice and interesting graphs! Alas not easy to interpret for the HDD, as there are better/worse variation all along the distribution, the lines cross one another, so how it fares overall is unclear. Maybe a simple indication would be to compute the standard deviation on the per second tps? The median maybe interesting as well. I do have some more data, but those are the most interesting charts. The rest usually shows about the same thing (or nothing). Overall, I'm not quite sure the patches actually achieve the intended goals. On the 10k SAS drives I got better performance, but apparently much more variable behavior. On SSDs, I get a bit worse results. Indeed. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHO wrote: >> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO >> somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge. >> >> >> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu > > > Interesting! To summarize it, 25% performance degradation from best kernel > (2.6.32) to worst (3.2.0), that is indeed significant. As far as I recall, the OS cache eviction is very aggressive in 3.2, so it would be possible that data from the FS cache that was just read could be evicted even if it was not used yet. Thie represents a large difference when the database does not fit in RAM. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hallo Patric, Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge. https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu Interesting! To summarize it, 25% performance degradation from best kernel (2.6.32) to worst (3.2.0), that is indeed significant. You might consider upgrading your kernel to 3.13 LTS. It's quite easy [...] There are other stuff running on the hardware that I do not wish to touch, so upgrading the particular host is currently not an option, otherwise I would have switched to trusty. Thanks for the pointer. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Fabien, Fabien COELHO schrieb am 19.02.2016 um 16:04: > >>> [...] Ubuntu 12.04 LTS (precise) >> >> That's with 12.04's standard kernel? > > Yes. Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge. https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu You might consider upgrading your kernel to 3.13 LTS. It's quite easy normally: https://wiki.ubuntu.com/Kernel/LTSEnablementStack /Patric -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) Comment: GnuPT 2.5.2 iEYEARECAAYFAlbHW4AACgkQfGgGu8y7ypC1EACgy8mW6AoaWjKycbuAnCZ3CEPW Al8AmwfF0smqmDvNsaPkq0dAtop7jP5M =TxT+ -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello. Based on these results I think 32 will be a good default for checkpoint_flush_after? There's a few cases where 64 showed to be beneficial, and some where 32 is better. I've seen 64 perform a bit better in some cases here, but the differences were not too big. Yes, these many runs show that 32 is basically as good or better than 64. I'll do some runs with 16/48 to have some more data. I gather that you didn't play with backend_flush_after/bgwriter_flush_after, i.e. you left them at their default values? Especially backend_flush_after can have a significant positive and negative performance impact. Indeed, non reported configuration options have their default values. There were also minor changes in the default options for logging (prefix, checkpoint, ...), but nothing significant, and always the same for all runs. [...] Ubuntu 12.04 LTS (precise) That's with 12.04's standard kernel? Yes. checkpoint_flush_after = { none, 0, 32, 64 } Did you re-initdb between the runs? Yes, all runs are from scratch (initdb, pgbench -i, some warmup...). I've seen massively varying performance differences due to autovacuum triggered analyzes. It's not completely deterministic when those run, and on bigger scale clusters analyze can take ages, while holding a snapshot. Yes, I agree that probably the performance changes on long vs short runs (andres00c vs andres00b) is due to autovacuum. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hi, On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote: > Below the results of a lot of tests with pgbench to exercise checkpoints on > the above version when fetched. Wow, that's a great test series. > Overall comments: > - sorting & flushing is basically always a winner > - benchmarking with short runs on large databases is a bad idea >the results are very different if a longer run is used >(see andres00b vs andres00c) Based on these results I think 32 will be a good default for checkpoint_flush_after? There's a few cases where 64 showed to be beneficial, and some where 32 is better. I've seen 64 perform a bit better in some cases here, but the differences were not too big. I gather that you didn't play with backend_flush_after/bgwriter_flush_after, i.e. you left them at their default values? Especially backend_flush_after can have a significant positive and negative performance impact. > 16 GB 2 cpu 8 cores > 200 GB RAID1 HDD, ext4 FS > Ubuntu 12.04 LTS (precise) That's with 12.04's standard kernel? > postgresql.conf: >shared_buffers = 1GB >max_wal_size = 1GB >checkpoint_timeout = 300s >checkpoint_completion_target = 0.8 >checkpoint_flush_after = { none, 0, 32, 64 } Did you re-initdb between the runs? I've seen massively varying performance differences due to autovacuum triggered analyzes. It's not completely deterministic when those run, and on bigger scale clusters analyze can take ages, while holding a snapshot. > Hmmm, interesting: maintenance_work_mem seems to have some influence on > performance, although it is not too consistent between settings, probably > because as the memory is used to its limit the performance is quite > sensitive to the available memory. That's probably because of differing behaviour of autovacuum/vacuum, which sometime will have to do several scans of the tables if there are too many dead tuples. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Andres, I don't want to post a full series right now, but my working state is available on http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush Below the results of a lot of tests with pgbench to exercise checkpoints on the above version when fetched. Overall comments: - sorting & flushing is basically always a winner - benchmarking with short runs on large databases is a bad idea the results are very different if a longer run is used (see andres00b vs andres00c) # HOST/SOFT 16 GB 2 cpu 8 cores 200 GB RAID1 HDD, ext4 FS Ubuntu 12.04 LTS (precise) # ABOUT THE REPORTED STATISTICS tps: is the "excluding connection" time tps, the higher the better 1-sec tps: average of measured per-second tps note - it should be the same as the previous one, but due to various hazards in the trace, especially when things go badly and pg get stuck, it may be different. Such hazard also explain why there may be some non-integer tps reported for some seconds. stddev: standard deviation, the lower the better the five figures in bracket give a feel of the distribution: - min: minimal per-second tps seen in the trace - q1: first quarter per-second tps seen in the trace - med: median per-second tps seen in the trace - q3: third quarter per-second tps seen in the trace - max: maximal per-second tps seen in the trace the last percentage dubbed "<=10.0" is percent of seconds where performance is below 10 tps: this measures of how unresponsive pg was during the run ## TINY2 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4 with scale = 10 (~ 200 MB) postgresql.conf: shared_buffers = 1GB max_wal_size = 1GB checkpoint_timeout = 300s checkpoint_completion_target = 0.8 checkpoint_flush_after = { none, 0, 32, 64 } opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0 head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0% 1 | 2575.0 / 2575.1 ± 359.3 [ 1.0, 2595.9, 2712.0, 2732.0, 2847.0] 0.1% 2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0% 0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0% 1 | 2596.6 / 2596.9 ± 307.4 [296.0, 2590.5, 2707.9, 2738.0, 2847.8] 0.0% 2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0% 32 0 | 2625.5 / 2625.5 ± 250.5 [ 1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1% 1 | 2630.2 / 2630.2 ± 243.1 [301.8, 2654.9, 2697.2, 2726.0, 2837.4] 0.0% 2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0% 64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0% 1 | 2591.1 / 2591.1 ± 305.2 [455.9, 2608.9, 2680.2, 2734.1, 2859.0] 0.0% 2 | 2047.8 / 2046.4 ± 925.8 [ 0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ? Pretty small setup, all data fit in buffers. Good tps performance all around (best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction in tps stddev. ## SMALL pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4 with scale = 120 (~ 2 GB) postgresql.conf: shared_buffers = 2GB checkpoint_timeout = 300s checkpoint_completion_target = 0.8 checkpoint_flush_after = { none, 0, 32, 64 } opts # | tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0 head 0 | 209.2 / 204.2 ± 516.5 [0.0, 0.0, 4.0,5.0, 2251.0] 82.3% 1 | 207.4 / 204.2 ± 518.7 [0.0, 0.0, 4.0,5.0, 2245.1] 82.3% 2 | 217.5 / 211.0 ± 530.3 [0.0, 0.0, 3.0,5.0, 2255.0] 82.0% 3 | 217.8 / 213.2 ± 531.7 [0.0, 0.0, 4.0,6.0, 2261.9] 81.7% 4 | 230.7 / 223.9 ± 542.7 [0.0, 0.0, 4.0,7.0, 2282.0] 80.7% 0 0 | 734.8 / 735.5 ± 879.9 [0.0, 1.0, 16.5, 1748.3, 2281.1] 47.0% 1 | 694.9 / 693.0 ± 849.0 [0.0, 1.0, 29.5, 1545.7, 2428.0] 46.4% 2 | 735.3 / 735.5 ± 888.4 [0.0, 0.0, 12.0, 1781.2, 2312.1] 47.9% 3 | 736.0 / 737.5 ± 887.1 [0.0, 1.0, 16.0, 1794.3, 2317.0] 47.5% 4 | 734.9 / 735.1 ± 885.1 [0.0, 1.0, 15.5, 1781.0, 2297.1] 47.2% 32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0, 753.0, 2312.1] 0.2% 1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5, 671.0, 744.0, 2319.0] 0.1% 2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0, 756.0, 2331.0] 0.2% 3 | 744.1 / 744.1 ± 414.4 [0.0, 555.5, 685.2, 758.0, 2285.1] 0.1% 4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0, 759.0, 2308.1] 0.1% 64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0, 759.0, 2353.0] 0.1% 1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2, 680.0, 758.2, 2296.0] 0.1% 2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1, 757.0, 2310.0] 0.1% 3 | 529.0 / 526.6 ± 410.9 [0.0, 245.0, 444.0, 701.0, 2380.9] 1.5% ?? 4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0, 754.0, 2298.0] 0.1% Sorting brings * 3.3 tps, flushing significantly reduces tps std
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote: > I've looked at these patches, especially the whole bench of explanations and > comments which is a good source for understanding what is going on in the > WAL writer, a part of pg I'm not familiar with. > > When reading the patch 0002 explanations, I had the following comments: > > AFAICS, there are several levels of actions when writing things in pg: > > 0: the thing is written in some internal buffer > > 1: the buffer is advised to be passed to the OS (hint bits?) Hint bits aren't related to OS writes. They're about information like 'this transaction committed' or 'all tuples on this page are visible'. > 2: the buffer is actually passed to the OS (write, flush) > > 3: the OS is advised to send the written data to the io subsystem > (sync_file_range with SYNC_FILE_RANGE_WRITE) > > 4: the OS is required to send the written data to the disk > (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER) We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) - the guarantees it gives aren't well defined, and actually changed across releases. 0002 is about something different, it's about the WAL writer. Which writes WAL to disk, so individual backends don't have to. It does so in the background every wal_writer_delay or whenever a tranasaction asynchronously commits. The reason this interacts with checkpoint flushing is that, when we flush writes on a regular pace, the writes by the checkpointer happen inbetween the very frequent writes/fdatasync() by the WAL writer. That means the disk's caches are flushed every fdatasync() - which causes considerable slowdowns. On a decent SSD the WAL writer, before this patch, often did 500-1000 fdatasync()s a second; the regular sync_file_range calls slowed down things too much. That's what caused the large regression when using checkpoint sorting/flushing with synchronous_commit=off. With that fixed - often a performance improvement on its own - I don't see that regression anymore. > After more considerations, my final understanding is that this behavior only > occurs with "asynchronous commit", aka a situation when COMMIT does not wait > for data to be really fsynced, but the fsync is to occur within some delay > so it will not be too far away, some kind of compromise for performance > where commits can be lost. Right. > Now all this is somehow alien to me because the whole point of committing is > having the data to disk, and I would not consider a database to be safe if > commit does not imply fsync, but I understand that people may have to > compromise for performance. It's obviously not applicable for every scenario, but in a *lot* of real-world scenario a sub-second loss window doesn't have any actual negative implications. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-11 19:44:25 +0100, Andres Freund wrote: > The first two commits of the series are pretty close to being ready. I'd > welcome review of those, and I plan to commit them independently of the > rest as they're beneficial independently. The most important bits are > the comments and docs of 0002 - they weren't particularly good > beforehand, so I had to rewrite a fair bit. > > 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the > potential regressions of 0002 > 0002: Fix the overaggressive flushing by the wal writer, by only > flushing every wal_writer_delay ms or wal_writer_flush_after > bytes. I've pushed these after some more polishing, now working on the next two. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Andres, 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the potential regressions of 0002 0002: Fix the overaggressive flushing by the wal writer, by only flushing every wal_writer_delay ms or wal_writer_flush_after bytes. I've looked at these patches, especially the whole bench of explanations and comments which is a good source for understanding what is going on in the WAL writer, a part of pg I'm not familiar with. When reading the patch 0002 explanations, I had the following comments: AFAICS, there are several levels of actions when writing things in pg: 0: the thing is written in some internal buffer 1: the buffer is advised to be passed to the OS (hint bits?) 2: the buffer is actually passed to the OS (write, flush) 3: the OS is advised to send the written data to the io subsystem (sync_file_range with SYNC_FILE_RANGE_WRITE) 4: the OS is required to send the written data to the disk (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER) It is not clear when reading the text which level is discussed. In particular, I'm not sure that "flush" refers to level 2, which is misleading. When reading the description, I'm rather under the impression that it is about level 4, but then if actual fsync are performed every 200 ms then the tps would be very low... After more considerations, my final understanding is that this behavior only occurs with "asynchronous commit", aka a situation when COMMIT does not wait for data to be really fsynced, but the fsync is to occur within some delay so it will not be too far away, some kind of compromise for performance where commits can be lost. Now all this is somehow alien to me because the whole point of committing is having the data to disk, and I would not consider a database to be safe if commit does not imply fsync, but I understand that people may have to compromise for performance. Is my understanding right? -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On Thu, Feb 11, 2016 at 1:44 PM, Andres Freund wrote: > On 2016-02-04 16:54:58 +0100, Andres Freund wrote: >> Fabien asked me to post a new version of the checkpoint flushing patch >> series. While this isn't entirely ready for commit, I think we're >> getting closer. >> >> I don't want to post a full series right now, but my working state is >> available on >> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush >> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush > > The first two commits of the series are pretty close to being ready. I'd > welcome review of those, and I plan to commit them independently of the > rest as they're beneficial independently. The most important bits are > the comments and docs of 0002 - they weren't particularly good > beforehand, so I had to rewrite a fair bit. > > 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the > potential regressions of 0002 > 0002: Fix the overaggressive flushing by the wal writer, by only > flushing every wal_writer_delay ms or wal_writer_flush_after > bytes. I previously reviewed 0001 and I think it's fine. I haven't reviewed 0002 in detail, but I like the concept. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-04 16:54:58 +0100, Andres Freund wrote: > Fabien asked me to post a new version of the checkpoint flushing patch > series. While this isn't entirely ready for commit, I think we're > getting closer. > > I don't want to post a full series right now, but my working state is > available on > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush > git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush The first two commits of the series are pretty close to being ready. I'd welcome review of those, and I plan to commit them independently of the rest as they're beneficial independently. The most important bits are the comments and docs of 0002 - they weren't particularly good beforehand, so I had to rewrite a fair bit. 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the potential regressions of 0002 0002: Fix the overaggressive flushing by the wal writer, by only flushing every wal_writer_delay ms or wal_writer_flush_after bytes. Greetings, Andres Freund >From f3bc3a7c40c21277331689595814b359c55682dc Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 11 Feb 2016 19:34:29 +0100 Subject: [PATCH 1/6] Allow SetHintBits() to succeed if the buffer's LSN is new enough. Previously we only allowed SetHintBits() to succeed if the commit LSN of the last transaction touching the page has already been flushed to disk. We can't generally change the LSN of the page, because we don't necessarily have the required locks on the page. But the required LSN interlock does not require the commit record to be flushed, it just requires that the commit record will be flushed before the page is written out. Therefore if the buffer LSN is newer than the commit LSN, the hint bit can be safely set. In a number of scenarios (e.g. pgbench) this noticeably increases the number of hint bits are set. But more importantly it also keeps the success rate up when flushing WAL less frequently. That was the original reason for commit 4de82f7d7, which has negative performance consequences in a number of scenarios. This will allow a follup commit to reduce the flush rate. Discussion: 20160118163908.gw10...@awork2.anarazel.de --- src/backend/utils/time/tqual.c | 21 + 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index 465933d..503bd1d 100644 --- a/src/backend/utils/time/tqual.c +++ b/src/backend/utils/time/tqual.c @@ -89,12 +89,13 @@ static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot); * Set commit/abort hint bits on a tuple, if appropriate at this time. * * It is only safe to set a transaction-committed hint bit if we know the - * transaction's commit record has been flushed to disk, or if the table is - * temporary or unlogged and will be obliterated by a crash anyway. We - * cannot change the LSN of the page here because we may hold only a share - * lock on the buffer, so we can't use the LSN to interlock this; we have to - * just refrain from setting the hint bit until some future re-examination - * of the tuple. + * transaction's commit record is guaranteed to be flushed to disk before the + * buffer, or if the table is temporary or unlogged and will be obliterated by + * a crash anyway. We cannot change the LSN of the page here because we may + * hold only a share lock on the buffer, so we can only use the LSN to + * interlock this if the buffer's LSN already is newer than the commit LSN; + * otherwise we have to just refrain from setting the hint bit until some + * future re-examination of the tuple. * * We can always set hint bits when marking a transaction aborted. (Some * code in heapam.c relies on that!) @@ -122,8 +123,12 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer, /* NB: xid must be known committed here! */ XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); - if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)) - return;/* not flushed yet, so don't set hint */ + if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN) && + BufferGetLSNAtomic(buffer) < commitLSN) + { + /* not flushed and no LSN interlock, so don't set hint */ + return; + } } tuple->t_infomask |= infomask; -- 2.7.0.229.g701fa7f >From e4facce2cf8b982408ff1de174cffc202852adfd Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 11 Feb 2016 19:34:29 +0100 Subject: [PATCH 2/6] Allow the WAL writer to flush WAL at a reduced rate. Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the likelihood that hint bits can be set quickly. More quickly set hint bits can reduce contention around the clog et al. But unfortunately the increased flush rate can have a significant negative performance impact, I have measured up to a factor of ~4. The reason for this slowdown is that if there are independent writes to the underlying devices, for example
Re: [HACKERS] checkpointer continuous flushing - V16
On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHO wrote: > >>> I think I would appreciate comments to understand why/how the >>> ringbuffer is used, and more comments in general, so it is fine if >you >>> improve this part. >> >> I'd suggest to leave out the ringbuffer/new bgwriter parts. > >Ok, so the patch would only onclude the checkpointer stuff. > >I'll look at this part in detail. Yes, that's the more pressing part. I've seen pretty good results with the new bgwriter, but it's not really worthwhile until sorting and flushing is in... Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
I think I would appreciate comments to understand why/how the ringbuffer is used, and more comments in general, so it is fine if you improve this part. I'd suggest to leave out the ringbuffer/new bgwriter parts. Ok, so the patch would only onclude the checkpointer stuff. I'll look at this part in detail. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote: > I think I would appreciate comments to understand why/how the ringbuffer is > used, and more comments in general, so it is fine if you improve this part. I'd suggest to leave out the ringbuffer/new bgwriter parts. I think they'd be committed separately, and probably not in 9.6. Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hello Andres, Any comments before I spend more time polishing this? I'm running tests on various settings, I'll send a report when it is done. Up to now the performance seems as good as with the previous version. I'm currently updating docs and comments to actually describe the current state... I did notice the mismatched documentation. I think I would appreciate comments to understand why/how the ringbuffer is used, and more comments in general, so it is fine if you improve this part. Minor details: "typedefs.list" should be updated to WritebackContext. "WritebackContext" is a typedef, "struct" is not needed. I'll look at the code more deeply probably over next weekend. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hi Fabien, On 2016-02-04 16:54:58 +0100, Andres Freund wrote: > I don't want to post a full series right now, but my working state is > available on > http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush > git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush > > The main changes are that: > 1) the significant performance regressions I saw are addressed by >changing the wal writer flushing logic > 2) The flushing API moved up a couple layers, and now deals with buffer >tags, rather than the physical files > 3) Writes from checkpoints, bgwriter and files are flushed, configurable >by individual GUCs. Without that I still saw the spiked in a lot of > circumstances. > > There's also a more experimental reimplementation of bgwriter, but I'm > not sure it's realistic to polish that up within the constraints of 9.6. Any comments before I spend more time polishing this? I'm currently updating docs and comments to actually describe the current state... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] checkpointer continuous flushing - V16
Hi, Fabien asked me to post a new version of the checkpoint flushing patch series. While this isn't entirely ready for commit, I think we're getting closer. I don't want to post a full series right now, but my working state is available on http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush The main changes are that: 1) the significant performance regressions I saw are addressed by changing the wal writer flushing logic 2) The flushing API moved up a couple layers, and now deals with buffer tags, rather than the physical files 3) Writes from checkpoints, bgwriter and files are flushed, configurable by individual GUCs. Without that I still saw the spiked in a lot of circumstances. There's also a more experimental reimplementation of bgwriter, but I'm not sure it's realistic to polish that up within the constraints of 9.6. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers