Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/29/13 2:04 AM, KONDO Mitsumasa wrote: I think that it is almost same as small dirty_background_ratio or dirty_background_bytes. The main difference here is that all writes pushed out this way will be to a single 1GB relation chunk. The odds are better that multiple writes will combine, and that the I/O will involve a lower than average amount of random seeking. Whereas shrinking the size of the write cache always results in more random seeking. The essential improvement is not dirty page size in fsync() but scheduling of fsync phase. I can't understand why postgres does not consider scheduling of fsync phase. Because it cannot get the sort of latency improvements I think people want. I proved to myself it's impossible during the last 9.2 CF when I submitted several fsync scheduling change submissions. By the time you get to the fsync sync phase, on a system that's always writing heavily there is way too much backlog to possibly cope with by then. There just isn't enough time left before the checkpoint should end to write everything out. You have to force writes to actual disk to start happening earlier to keep a predictable schedule. Basically, the longer you go without issuing a fsync, the more uncertainty there is around how long it might take to fire. My proposal lets someone keep all I/O from ever reaching the point where the uncertainty is that high. In the simplest to explain case, imagine that a checkpoint includes a 1GB relation segment that is completely dirty in shared_buffers. When a checkpoint hits this, it will have 1GB of I/O to push out. If you have waited this long to fsync the segment, the problem is now too big to fix by checkpoint time. Even if the 1GB of writes are themselves nicely ordered and grouped on disk, the concurrent background ability is going to chop the combination up into more random I/O than the ideal. Regular consumer disks have a worst case random I/O throughput of less than 2MB/s. My observed progress rates for such systems show you're lucky to get 10MB/s of writes out of them. So how long will the dirty 1GB in the segment take to write? 1GB @ 10MB/s = 102.4 *seconds*. And that's exactly what I saw whenever I tried to play with checkpoint sync scheduling. No matter what you do there, periodically you'll hit a segment that has over a minute of dirty data accumulated, and >60 second latency pauses result. By the point you've reached checkpoint, you're dead when you call fsync on that relation. You *must* hit that segment with fsync more often than once per checkpoint to achieve reasonable latency. With this "linear slider" idea, I might tune such that no segment will ever get more than 256MB of writes before hitting a fsync instead. I can't guarantee that will work usefully, but the shape of the idea seems to match the problem. Taken together my checkpoint proposal method, * write phase - Almost same, but considering fsync phase schedule. - Considering case of background-write in OS, sort buffer before starting checkpoint write. This cannot work for the reasons I've outlined here. I guarantee you I will easily find a test workload where it performs worse than what's happening right now. If you want to play with this to learn more about the trade-offs involved, that's fine, but expect me to vote against accepting any change of this form. I would prefer you to not submit them because it will waste a large amount of reviewer time to reach that conclusion yet again. And I'm not going to be that reviewer. * fsync phase - Considering checkpoint schedule and write-phase schedule - Executing separated sync_file_range() and sleep, in final fsync(). If you can figure out how to use sync_file_range() to fine tune how much fsync is happening at any time, that would be useful on all the platforms that support it. I haven't tried it just because that looked to me like a large job refactoring the entire fsync absorb mechanism, and I've never had enough funding to take it on. That approach has a lot of good properties, if it could be made to work without a lot of code changes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 7:32 AM, Tom Lane wrote: Greg Smith writes: On 7/26/13 5:59 AM, Hannu Krosing wrote: Well, SSD disks do it in the way proposed by didier (AFAIK), by putting "random" fs pages on one large disk page and having an extra index layer for resolving random-to-sequential ordering. If your solution to avoiding random writes now is to do sequential ones into a buffer, you'll pay for it by having more expensive random reads later. What I'd point out is that that is exactly what WAL does for us, ie convert a bunch of random writes into sequential writes. But sooner or later you have to put the data where it belongs. FWIW, at RICon East there was someone from Seagate that gave a presentation. One of his points is that even spinning rust is moving to the point where the drive itself has to do some kind of write log. He notes that modern filesystems do the same thing, and the overlap is probably stupid (I pointed out that the most degenerate case is the logging database on the logging filesystem on the logging drive...) It'd be interesting for Postgres to work with drive manufacturers to study ways to get rid of the extra layers of stupid... -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
(2013/07/24 1:13), Greg Smith wrote: On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. By my count, it can include up to 131,072 changed 8K pages. Even better! I can pinpoint exactly what time last night I got tired enough to start making trivial mistakes. Everywhere I said 128 it's actually 131,072, which just changes the range of the GUC I proposed. I think that it is almost same as small dirty_background_ratio or dirty_background_bytes. This method will be very bad performance, and many fsync() may be caused long fsync situaition which was said past by you. My colleagues who are kernel expert say, in executing fsync(), other process write same file a lot, it does not return fsync call function occasionally. So too many fsync with large file is very dangerous. Moreover fsync() also write metadata, it is worst for performance. The essential improvement is not dirty page size in fsync() but scheduling of fsync phase. I can't understand why postgres does not consider scheduling of fsync phase. When dirty_background_ratio is big, in write phase does not write to disk at all, therefore, fsync() is too heavy in fsync phase. Getting the number right really highlights just how bad the current situation is. Would you expect the database to dump up to 128K writes into a file and then have low latency when it's flushed to disk with fsync? Of course not. I think that it will be improved this problem by sync_file_range() in fsync phase, and adding checkpoint schedule in fsync phase. Executing small range sync_file_range() and sleep, in final executing fsync(). I think it is better than your proposal. If a system do not support sync_file_range() system call, it only execute fsync and sleep, it is same our method (you and I posted past). Taken together my checkpoint proposal method, * write phase - Almost same, but considering fsync phase schedule. - Considering case of background-write in OS, sort buffer before starting checkpoint write. * fsync phase - Considering checkpoint schedule and write-phase schedule - Executing separated sync_file_range() and sleep, in final fsync(). And if I can, not write a buffer method which is called fsync() in a target file. I think it may be quite difficult. Best regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi, On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith (needrant.com > wrote: > On 7/26/13 9:14 AM, didier wrote: > >> During recovery you have to load the log in cache first before applying >> WAL. >> > > Checkpoints exist to bound recovery time after a crash. That is their > only purpose. What you're suggesting moves a lot of work into the recovery > path, which will slow down how long it takes to process. > > Yes it's slower but you're sequentially reading only one file at most the size of your buffer cache, moreover it's a constant time. Let say you make a checkpoint and crash just after with a next to empty WAL. Now recovery is very fast but you have to repopulate your cache with random reads from requests. With the snapshot it's slower but you read, sequentially again, a lot of hot cache you will need later when the db starts serving requests. Of course the worst case is if it crashes just before a checkpoint, most of the snapshot data are stalled and will be overwritten by WAL ops. But If the WAL recovery is CPU bound, loading from the snapshot may be done concurrently while replaying the WAL. More work at recovery time means someone who uses the default of > checkpoint_timeout='5 minutes', expecting that crash recovery won't take > very long, will discover it does take a longer time now. They'll be forced > to shrink the value to get the same recovery time as they do currently. > You might need to make checkpoint_timeout 3 minutes instead, if crash > recovery now has all this extra work to deal with. And when the time > between checkpoints drops, it will slow the fundamental efficiency of > checkpoint processing down. You will end up writing out more data in the > end. > Yes it's a trade off, now you're paying the price at checkpoint time, every time, with the log you're paying only once, at recovery. > > The interval between checkpoints and recovery time are all related. If > you let any one side of the current requirements slip, it makes the rest > easier to deal with. Those are all trade-offs though, not improvements. > And this particular one is already an option. > > If you want less checkpoint I/O per capita and don't care about recovery > time, you don't need a code change to get it. Just make checkpoint_timeout > huge. A lot of checkpoint I/O issues go away if you only do a checkpoint > per hour, because instead of random writes you're getting sequential ones > to the WAL. But when you crash, expect to be down for a significant chunk > of an hour, as you go back to sort out all of the work postponed before. It's not the same it's a snapshot saved and loaded in constant time unlike the WAL log. Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 8:32 AM, Tom Lane wrote: What I'd point out is that that is exactly what WAL does for us, ie convert a bunch of random writes into sequential writes. But sooner or later you have to put the data where it belongs. Hannu was observing that SSD often doesn't do that at all. They can maintain logical -> physical translation tables that decode where each block was written to forever. When read seeks are really inexpensive, the only pressure to reorder block is wear leveling. That doesn't really help with regular drives though, where the low seek time assumption doesn't play out so well. The whole idea of writing things sequentially and then sorting them out later was the rage in 2001 for ext3 on Linux, as part of the "data=journal" mount option. You can go back and see that people are confused but excited about the performance at http://www.ibm.com/developerworks/linux/library/l-fs8/index.html Spoiler: if you use a workload that has checkpoint issues, it doesn't help PostgreSQL latency. Just like using a large write cache, you gain some burst performance, but eventually you pay for it with extra latency somewhere. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 9:14 AM, didier wrote: During recovery you have to load the log in cache first before applying WAL. Checkpoints exist to bound recovery time after a crash. That is their only purpose. What you're suggesting moves a lot of work into the recovery path, which will slow down how long it takes to process. More work at recovery time means someone who uses the default of checkpoint_timeout='5 minutes', expecting that crash recovery won't take very long, will discover it does take a longer time now. They'll be forced to shrink the value to get the same recovery time as they do currently. You might need to make checkpoint_timeout 3 minutes instead, if crash recovery now has all this extra work to deal with. And when the time between checkpoints drops, it will slow the fundamental efficiency of checkpoint processing down. You will end up writing out more data in the end. The interval between checkpoints and recovery time are all related. If you let any one side of the current requirements slip, it makes the rest easier to deal with. Those are all trade-offs though, not improvements. And this particular one is already an option. If you want less checkpoint I/O per capita and don't care about recovery time, you don't need a code change to get it. Just make checkpoint_timeout huge. A lot of checkpoint I/O issues go away if you only do a checkpoint per hour, because instead of random writes you're getting sequential ones to the WAL. But when you crash, expect to be down for a significant chunk of an hour, as you go back to sort out all of the work postponed before. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi, On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith wrote: > On 7/25/13 6:02 PM, didier wrote: > >> It was surely already discussed but why isn't postresql writing >> sequentially its cache in a temporary file? >> > > If you do that, reads of the data will have to traverse that temporary > file to assemble their data. You'll make every later reader pay the random > I/O penalty that's being avoided right now. Checkpoints are already > postponing these random writes as long as possible. You have to take care > of them eventually though. > > > No the log file is only used at recovery time. in check point code: - loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in current code - other workers can't write and evicted these marked buffers to disk, there's a race with fsync. - check point fsync now or after the next step. - check point loop again save to log these buffers, clear BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers will be written again, as they are when check point isn't running. - check point done. During recovery you have to load the log in cache first before applying WAL. Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
Greg Smith writes: > On 7/26/13 5:59 AM, Hannu Krosing wrote: >> Well, SSD disks do it in the way proposed by didier (AFAIK), by putting >> "random" >> fs pages on one large disk page and having an extra index layer for >> resolving >> random-to-sequential ordering. > If your solution to avoiding random writes now is to do sequential ones > into a buffer, you'll pay for it by having more expensive random reads > later. What I'd point out is that that is exactly what WAL does for us, ie convert a bunch of random writes into sequential writes. But sooner or later you have to put the data where it belongs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/26/13 5:59 AM, Hannu Krosing wrote: Well, SSD disks do it in the way proposed by didier (AFAIK), by putting "random" fs pages on one large disk page and having an extra index layer for resolving random-to-sequential ordering. If your solution to avoiding random writes now is to do sequential ones into a buffer, you'll pay for it by having more expensive random reads later. In the SSD write buffer case, that works only because those random reads are very cheap. Do the same thing on a regular drive, and you'll be paying a painful penalty *every* time you read in return for saving work *once* when you write. That only makes sense when your workload is near write-only. It's possible to improve on this situation by having some sort of background process that goes back and cleans up the random data, converting it back into sequentially ordered writes again. SSD controllers also have firmware that does this sort of work, and Postgres might do it as part of vacuum cleanup. But note that such work faces exactly the same problems as writing the data out in the first place. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 07/26/2013 11:42 AM, Greg Smith wrote: > On 7/25/13 6:02 PM, didier wrote: >> It was surely already discussed but why isn't postresql writing >> sequentially its cache in a temporary file? > > If you do that, reads of the data will have to traverse that temporary > file to assemble their data. In case of crash recovery, a sequential reading of this file could be performed as first step. this should work fairly well in most cases, at least when the recovery shared_buffers is not smaller than the latest run of checkpoint-written dirty buffers. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 07/26/2013 11:42 AM, Greg Smith wrote: > On 7/25/13 6:02 PM, didier wrote: >> It was surely already discussed but why isn't postresql writing >> sequentially its cache in a temporary file? > > If you do that, reads of the data will have to traverse that temporary > file to assemble their data. You'll make every later reader pay the > random I/O penalty that's being avoided right now. Checkpoints are > already postponing these random writes as long as possible. You have > to take care of them eventually though. > Well, SSD disks do it in the way proposed by didier (AFAIK), by putting "random" fs pages on one large disk page and having an extra index layer for resolving random-to-sequential ordering. I would not dismiss the idea without more tests and discussion. We could have a system where "checkpoint" does sequential writes of dirty wal buffers to alternating synced holding files (a "checkpoint log" :) ) and only background writer does random writes with no forced sync at all -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that temporary file to assemble their data. You'll make every later reader pay the random I/O penalty that's being avoided right now. Checkpoints are already postponing these random writes as long as possible. You have to take care of them eventually though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi, > Sure, that's what the WAL does. But you still have to checkpoint > eventually. > > Sure, when you run pg_ctl stop. Unlike the WAL it only needs two files, shared_buffers size. I did bogus tests by replacing mask |= BM_PERMANENT; with mask = -1 in BufferSync() and simulating checkpoint with a periodic dd if=/dev/zero of=foo conv=fsync On a saturated storage with %usage locked solid at 100% I got up to 30% speed improvement and fsync latency down by one order of magnitude, some fsync were still slow of course if buffers were already in OS cache. But it's the upper bound, it's was done one a slow storage with bad ratios : (OS cache write)/(disk sequential write) in 50, (sequential write)/(effective random write) in 10 range and a proper implementation would have a 'little' more work to do... (only checkpoint task can write BM_CHECKPOINT_NEEDED buffers keeping them dirty and so on) Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
On Thu, Jul 25, 2013 at 6:02 PM, didier wrote: > It was surely already discussed but why isn't postresql writing > sequentially its cache in a temporary file? With storage random speed at > least five to ten time slower it could help a lot. > Thanks Sure, that's what the WAL does. But you still have to checkpoint eventually. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith wrote: > Recently I've been dismissing a lot of suggested changes to checkpoint > fsync timing without suggesting an alternative. I have a simple one in > mind that captures the biggest problem I see: that the number of backend > and checkpoint writes to a file are not connected at all. > > We know that a 1GB relation segment can take a really long time to write > out. That could include up to 128 changed 8K pages, and we allow all of > them to get dirty before any are forced to disk with fsync. > > It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? With storage random speed at least five to ten time slower it could help a lot. Thanks Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
On Tue, Jul 23, 2013 at 12:13 PM, Greg Smith wrote: > On 7/23/13 10:56 AM, Robert Haas wrote: >> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith wrote: >>> >>> We know that a 1GB relation segment can take a really long time to write >>> out. That could include up to 128 changed 8K pages, and we allow all of >>> them to get dirty before any are forced to disk with fsync. >> >> By my count, it can include up to 131,072 changed 8K pages. > > Even better! I can pinpoint exactly what time last night I got tired enough > to start making trivial mistakes. Everywhere I said 128 it's actually > 131,072, which just changes the range of the GUC I proposed. > > Getting the number right really highlights just how bad the current > situation is. Would you expect the database to dump up to 128K writes into > a file and then have low latency when it's flushed to disk with fsync? Of > course not. But that's the job the checkpointer process is trying to do > right now. And it's doing it blind--it has no idea how many dirty pages > might have accumulated before it started. > > I'm not exactly sure how best to use the information collected. fsync every > N writes is one approach. Another is to use accumulated writes to predict > how long fsync on that relation should take. Whenever I tried to spread > fsync calls out before, the scale of the piled up writes from backends was > the input I really wanted available. The segment write count gives an > alternate way to sort the blocks too, you might start with the heaviest hit > ones. > > In all these cases, the fundamental I keep coming back to is wanting to cue > off past write statistics. If you want to predict relative I/O delay times > with any hope of accuracy, you have to start the checkpoint knowing > something about the backend and background writer activity since the last > one. So, I don't think this is a bad idea; in fact, I think it'd be a good thing to explore. The hard part is likely to be convincing ourselves of anything about how well or poorly it works on arbitrary hardware under arbitrary workloads, but we've got to keep trying things until we find something that works well, so why not this? One general observation is that there are two bad things that happen when we checkpoint. One is that we force all of the data in RAM out to disk, and the other is that we start doing lots of FPIs. Both of these things harm throughput. Your proposal allows the user to make the first of those behaviors more frequent without making the second one more frequent. That idea seems promising, and it also seems to admit of many variations. For example, instead of issuing an fsync when after N OS writes to a particular file, we could fsync the file with the most writes every K seconds. That way, if the system has busy and idle periods, we'll effectively "catch up on our fsyncs" when the system isn't that busy, and we won't bunch them up too much if there's a sudden surge of activity. Now that's just a shot in the dark and there might be reasons why it's terrible, but I just generally offer it as food for thought that the triggering event for the extra fsyncs could be chosen via a multitude of different algorithms, and as you hack through this it might be worth trying a few different possibilities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On 7/23/13 10:56 AM, Robert Haas wrote: On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith wrote: We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. By my count, it can include up to 131,072 changed 8K pages. Even better! I can pinpoint exactly what time last night I got tired enough to start making trivial mistakes. Everywhere I said 128 it's actually 131,072, which just changes the range of the GUC I proposed. Getting the number right really highlights just how bad the current situation is. Would you expect the database to dump up to 128K writes into a file and then have low latency when it's flushed to disk with fsync? Of course not. But that's the job the checkpointer process is trying to do right now. And it's doing it blind--it has no idea how many dirty pages might have accumulated before it started. I'm not exactly sure how best to use the information collected. fsync every N writes is one approach. Another is to use accumulated writes to predict how long fsync on that relation should take. Whenever I tried to spread fsync calls out before, the scale of the piled up writes from backends was the input I really wanted available. The segment write count gives an alternate way to sort the blocks too, you might start with the heaviest hit ones. In all these cases, the fundamental I keep coming back to is wanting to cue off past write statistics. If you want to predict relative I/O delay times with any hope of accuracy, you have to start the checkpoint knowing something about the backend and background writer activity since the last one. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith wrote: > Recently I've been dismissing a lot of suggested changes to checkpoint fsync > timing without suggesting an alternative. I have a simple one in mind that > captures the biggest problem I see: that the number of backend and > checkpoint writes to a file are not connected at all. > > We know that a 1GB relation segment can take a really long time to write > out. That could include up to 128 changed 8K pages, and we allow all of > them to get dirty before any are forced to disk with fsync. By my count, it can include up to 131,072 changed 8K pages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Design proposal: fsync absorb linear slider
On Mon, Jul 22, 2013 at 8:48 PM, Greg Smith wrote: > And I can't get too excited about making this as my volunteer effort when I > consider what the resulting credit will look like. Coding is by far the > smallest part of work like this, first behind coming up with the design in > the first place. And both of those are way, way behind how long review > benchmarking takes on something like this. The way credit is distributed > for this sort of feature puts coding first, design not credited at all, and > maybe you'll see some small review credit for benchmarks. That's completely > backwards from the actual work ratio. If all I'm getting out of something > is credit, I'd at least like it to be an appropriate amount of it. FWIW, I think that's a reasonable request. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Design proposal: fsync absorb linear slider
Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend and checkpoint writes to a file are not connected at all. We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. Rather than second guess the I/O scheduling, I'd like to take this on directly by recognizing that the size of the problem is proportional to the number of writes to a segment. If you turned off fsync absorption altogether, you'd be at an extreme that allows only 1 write before fsync. That's low latency for each write, but terrible throughput. The maximum throughput case of 128 writes has the terrible latency we get reports about. But what if that trade-off was just a straight, linear slider going from 1 to 128? Just move it to the latency vs. throughput position you want, and see how that works out. The implementation I had in mind was this one: -Add an absorption_count to the fsync queue. -Add a new latency vs. throughput GUC I'll call . Its default value is -1 (or 0), which corresponds to ignoring this new behavior. -Whenever the background write absorbs a fsync call for a relation that's already in the queue, increment the absorption count. -max_segment_absorb > 0, have the background writer scan for relations where absorption_count > max_segment_absorb. When it finds one, call fsync on that segment. Note that it's possible for this simple scheme to be fooled when writes are actually touching a small number of pages. A process that constantly overwrites the same page is the worst case here. Overwrite it 128 times, and this method would assume you've dirtied every page, while only 1 will actually go to disk when you call fsync. It's possible to track this better. The count mechanism could be replaced with a bitmap of the 128 blocks, so that absorbs set a bit instead of incrementing a count. My gut feel is that this is more complexity than is really necessary here. If in fact the fsync is slimmer than expected, paying for it too much isn't the worst problem to have here. I'd like to build this myself, but if someone else wants to take a shot at it I won't mind. Just be aware the review is the big part here. I should be honest about one thing: I have zero incentive to actually work on this. The moderate amount of sponsorship money I've raised for 9.4 so far isn't getting anywhere near this work. The checkpoint patch review I have been doing recently is coming out of my weekend volunteer time. And I can't get too excited about making this as my volunteer effort when I consider what the resulting credit will look like. Coding is by far the smallest part of work like this, first behind coming up with the design in the first place. And both of those are way, way behind how long review benchmarking takes on something like this. The way credit is distributed for this sort of feature puts coding first, design not credited at all, and maybe you'll see some small review credit for benchmarks. That's completely backwards from the actual work ratio. If all I'm getting out of something is credit, I'd at least like it to be an appropriate amount of it. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers