[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas robertmh...@gmail.com wrote: On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera Yeah, it seems there are two patches here -- one is the addition of fsync_fname() and the other is the fsync_prepare stuff. Sorry, I'm just catching up on my mail from FOSDEM this past weekend. I had come to the same conclusion as Greg that I might as well just commit it with Tom's pg_flush_data() name and we can decide later if and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth keeping two apis or not. So I was just going to commit it like that but I discovered last week that I don't have cvs write access set up yet. I'll commit it as soon as I generate a new ssh key and Dave installs it, etc. I intentionally picked a small simple patch that nobody was waiting on because I knew there was a risk of delays like this and the paperwork. I'm nearly there. -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Robert Haas wrote: Well it seems that what we're trying to implement is more like it_would_be_nice_if_you_would_start_syncing_this_file_range_but_its_ok_if_you_dont(), so maybe that would work. Anyway, is there something that we can agree on and get committed here for 9.0, or should we postpone this to 9.1? It seems simple enough that we ought to be able to get it done, but we're running out of time and we don't seem to have a clear vision here yet... This is turning into yet another one of those situations where something simple and useful is being killed by trying to generalize it way more than it needs to be, given its current goals and its lack of external interfaces. There's no catversion bump or API breakage to hinder future refactoring if this isn't optimally designed internally from day one. The feature is valuable and there seems at least one spot where it may be resolving the possibility of a subtle OS interaction bug by being more thorough in the way that it writes and syncs. The main contention seems to be over naming and completely optional additional abstraction. I consider the whole let's make this cover every type of complicated sync on every platform goal interesting and worthwhile, but it's completely optional for this release. The stuff being fretted over now is ultimately an internal interface that can be refactored at will in later releases with no user impact. If the goal here could be shifted back to finding the minimal level of abstraction that doesn't seem completely wrong, then updating the function names and comments to match that more closely, this could return to committable. That's all I thought was left to do when I moved it to ready for committer, and as far as I've seen this expanded scope of discussion has just moved backwards from that point. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote: Greg Smith g...@2ndquadrant.com writes: This is turning into yet another one of those situations where something simple and useful is being killed by trying to generalize it way more than it needs to be, given its current goals and its lack of external interfaces. There's no catversion bump or API breakage to hinder future refactoring if this isn't optimally designed internally from day one. I agree that it's too late in the cycle for any major redesign of the patch. But is it too much to ask to use a less confusing name for the function? +1. Let's just rename the thing, add some comments, and call it good. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Andres Freund escribió: I personally think the fsync on the directory should be added to the stable branches - other opinions? If wanted I can prepare patches for that. Yeah, it seems there are two patches here -- one is the addition of fsync_fname() and the other is the fsync_prepare stuff. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Andres Freund escribió: I personally think the fsync on the directory should be added to the stable branches - other opinions? If wanted I can prepare patches for that. Yeah, it seems there are two patches here -- one is the addition of fsync_fname() and the other is the fsync_prepare stuff. Andres, you want to take a crack at splitting this up? ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Sat, Feb 6, 2010 at 7:03 AM, Andres Freund and...@anarazel.de wrote: On Saturday 06 February 2010 06:03:30 Greg Smith wrote: Andres Freund wrote: On 02/03/10 14:42, Robert Haas wrote: Well, maybe we should start with a discussion of what kernel calls you're aware of on different platforms and then we could try to put an API around it. In linux there is sync_file_range. On newer Posixish systems one can emulate that with mmap() and msync() (in batches obviously). No idea about windows. The effective_io_concurrency feature had proof of concept test programs that worked using AIO, but actually following through on that implementation would require a major restructuring of how the database interacts with the OS in terms of reads and writes of blocks. It looks to me like doing something similar to sync_file_range on Windows would be similarly difficult. Looking a bit arround it seems one could achieve something approximediately similar to pg_prepare_fsync() by using CreateFileMapping MapViewOfFile FlushViewOfFile If I understand it correctly that will flush, but not wait. Unfortunately you cant event make it wait, so its not possible to implement sync_file_range or similar fully. Well it seems that what we're trying to implement is more like it_would_be_nice_if_you_would_start_syncing_this_file_range_but_its_ok_if_you_dont(), so maybe that would work. Anyway, is there something that we can agree on and get committed here for 9.0, or should we postpone this to 9.1? It seems simple enough that we ought to be able to get it done, but we're running out of time and we don't seem to have a clear vision here yet... ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Andres Freund wrote: On 02/03/10 14:42, Robert Haas wrote: Well, maybe we should start with a discussion of what kernel calls you're aware of on different platforms and then we could try to put an API around it. In linux there is sync_file_range. On newer Posixish systems one can emulate that with mmap() and msync() (in batches obviously). No idea about windows. There's a series of parameters you can pass into CreateFile: http://msdn.microsoft.com/en-us/library/aa363858(VS.85).aspx A lot of these are already mapped inside of src/port/open.c in a pretty straightforward way from the POSIX-oriented interface: O_RDWR,O_WRONLY - GENERIC_WRITE, GENERIC_READ O_RANDOM - FILE_FLAG_RANDOM_ACCESS O_SEQUENTIAL - FILE_FLAG_SEQUENTIAL_SCAN O_SHORT_LIVED - FILE_ATTRIBUTE_TEMPORARY O_TEMPORARY - FILE_FLAG_DELETE_ON_CLOSE O_DIRECT - FILE_FLAG_NO_BUFFERING O_DSYNC - FILE_FLAG_WRITE_THROUGH You have to read the whole Caching Behavior section to see exactly how all of those interact, and even then notes like http://support.microsoft.com/kb/99794 are needed to follow the fine points of things like FILE_FLAG_NO_BUFFERING vs. FILE_FLAG_WRITE_THROUGH. So anything that's setting those POSIX open flags better than before is getting the benefit of that improvement on Windows, too. But that's not quite the same as the changes using fadvise to provide better targeted cache control hints. I'm getting the impression that doing much better on Windows might fall into the same sort of category as Solaris, where the primary interface for this sort of thing is to use an AIO implementation instead: http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx The effective_io_concurrency feature had proof of concept test programs that worked using AIO, but actually following through on that implementation would require a major restructuring of how the database interacts with the OS in terms of reads and writes of blocks. It looks to me like doing something similar to sync_file_range on Windows would be similarly difficult. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas robertmh...@gmail.com wrote: I think you're probably right, but it's not clear what the new name should be until we have a comment explaining what the function is responsible for. So I wrote some comments but wasn't going to repost the patch with the unchanged name without explanation... But I think you're right though I was looking at it the other way around. I want to have an API for a two-stage sync and of course if I do that I'll comment it to explain that clearly. The gist of the comments was that the function is preparing to fsync to initiate the i/o early and allow the later fsync to fast -- but also at the same time have the beneficial side-effect of avoiding cache poisoning. It's not clear that the two are necessarily linked though. Perhaps we need two separate apis, though it'll be hard to keep them separate on all platforms. -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On 02/03/10 12:53, Greg Stark wrote: On Tue, Feb 2, 2010 at 7:45 PM, Robert Haasrobertmh...@gmail.com wrote: I think you're probably right, but it's not clear what the new name should be until we have a comment explaining what the function is responsible for. So I wrote some comments but wasn't going to repost the patch with the unchanged name without explanation... But I think you're right though I was looking at it the other way around. I want to have an API for a two-stage sync and of course if I do that I'll comment it to explain that clearly. The gist of the comments was that the function is preparing to fsync to initiate the i/o early and allow the later fsync to fast -- but also at the same time have the beneficial side-effect of avoiding cache poisoning. It's not clear that the two are necessarily linked though. Perhaps we need two separate apis, though it'll be hard to keep them separate on all platforms. I vote for two seperate apis - sure, there will be some unfortunate overlap for most unixoid platforms but its sure better possibly to allow adding more platforms later at a centralized place than having to analyze every place where the api is used. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Wed, Feb 3, 2010 at 6:53 AM, Greg Stark gsst...@mit.edu wrote: On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas robertmh...@gmail.com wrote: I think you're probably right, but it's not clear what the new name should be until we have a comment explaining what the function is responsible for. So I wrote some comments but wasn't going to repost the patch with the unchanged name without explanation... But I think you're right though I was looking at it the other way around. I want to have an API for a two-stage sync and of course if I do that I'll comment it to explain that clearly. The gist of the comments was that the function is preparing to fsync to initiate the i/o early and allow the later fsync to fast -- but also at the same time have the beneficial side-effect of avoiding cache poisoning. It's not clear that the two are necessarily linked though. Perhaps we need two separate apis, though it'll be hard to keep them separate on all platforms. Well, maybe we should start with a discussion of what kernel calls you're aware of on different platforms and then we could try to put an API around it. I mean, right now all you've got is POSIX_FADV_DONTNEED, so given just that I feel like the API could simply be pg_dontneed() or something. It's hard to design a general framework based on one example. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On 02/03/10 14:42, Robert Haas wrote: On Wed, Feb 3, 2010 at 6:53 AM, Greg Starkgsst...@mit.edu wrote: On Tue, Feb 2, 2010 at 7:45 PM, Robert Haasrobertmh...@gmail.com wrote: I think you're probably right, but it's not clear what the new name should be until we have a comment explaining what the function is responsible for. So I wrote some comments but wasn't going to repost the patch with the unchanged name without explanation... But I think you're right though I was looking at it the other way around. I want to have an API for a two-stage sync and of course if I do that I'll comment it to explain that clearly. The gist of the comments was that the function is preparing to fsync to initiate the i/o early and allow the later fsync to fast -- but also at the same time have the beneficial side-effect of avoiding cache poisoning. It's not clear that the two are necessarily linked though. Perhaps we need two separate apis, though it'll be hard to keep them separate on all platforms. Well, maybe we should start with a discussion of what kernel calls you're aware of on different platforms and then we could try to put an API around it. In linux there is sync_file_range. On newer Posixish systems one can emulate that with mmap() and msync() (in batches obviously). No idea about windows. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark gsst...@mit.edu wrote: On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote: That function *seriously* needs documentation, in particular the fact that it's a no-op on machines without the right kernel call. The name you've chosen is very bad for those semantics. I'd pick something else myself. Maybe pg_start_data_flush or something like that? I would like to make one token argument in favour of the name I picked. If it doesn't convince I'll change it since we can always revisit the API down the road. I envision having two function calls, pg_fsync_start() and pg_fsync_finish(). The latter will wait until the data synced in the first call is actually synced. The fall-back if there's no implementation of this would be for fsync_start() to be a noop (or something unreliable like posix_fadvise) and fsync_finish() to just be a regular fsync. I think we can accomplish this with sync_file_range() but I need to read up on how it actually works a bit more. In this case it doesn't make a difference since when we call fsync_finish() it's going to be for the entire file and nothing else will have been writing to these files. But for wal writing and checkpointing it might have very different performance characteristics. The big objection to this is that then we don't really have an api for FADV_DONT_NEED which is more about cache policy than about syncing to disk. So for example a sequential scan might want to indicate that it isn't planning on reading the buffers it's churning through but doesn't want to force them to be written sooner than otherwise and is never going to call fsync_finish(). I took a look at this patch today and I agree with Tom that pg_fsync_start() is a very confusing name. I don't know what the right name is, but this doesn't fsync so I don't think it shuld have fsync in the name. Maybe something like pg_advise_abandon() or pg_abandon_cache(). The current name is really wishful thinking: you're hoping that it will make the kernel start the fsync, but it might not. I think pg_start_data_flush() is similarly optimistic. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tue, Feb 2, 2010 at 12:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andres Freund and...@anarazel.de writes: On Tuesday 02 February 2010 18:36:12 Robert Haas wrote: I took a look at this patch today and I agree with Tom that pg_fsync_start() is a very confusing name. I don't know what the right name is, but this doesn't fsync so I don't think it shuld have fsync in the name. Maybe something like pg_advise_abandon() or pg_abandon_cache(). The current name is really wishful thinking: you're hoping that it will make the kernel start the fsync, but it might not. I think pg_start_data_flush() is similarly optimistic. What about: pg_fsync_prepare(). prepare_for_fsync()? It still seems mis-descriptive to me. Couldn't the same routine be used simply to abandon undirtied data that we no longer care about caching? ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tue, Feb 2, 2010 at 1:34 PM, Andres Freund and...@anarazel.de wrote: For now it could - but it very well might be converted to sync_file_range or similar, which would have different sideeffects. As the potential code duplication is rather small I would prefer to describe the prime effect not the sideeffects... Hmm, in that case, I think the problem is that this function has no comment explaining its intended charter. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
On Tue, Feb 2, 2010 at 2:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Hmm, in that case, I think the problem is that this function has no comment explaining its intended charter. That's certainly a big problem, but a comment won't fix the fact that the name is misleading. We need both a comment and a name change. I think you're probably right, but it's not clear what the new name should be until we have a comment explaining what the function is responsible for. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Re: [HACKERS] Re: Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb)
Greg Stark wrote: Actually before we get there could someone who demonstrated the speedup verify that this patch still gets that same speedup? Let's step back a second and get to the bottom of why some people are seeing this and others aren't. The original report here suggested this was an ext4 issue. As I pointed out recently on the performance list, the reason for that is likely that the working write-barrier support for ext4 means it's passing through the fsync to lying hard drives via a proper cache flush, which didn't happen on your typical ext3 install. Given that, I'd expect I could see the same issue with ext3 given a drive with its write cache turned off, so that the theory I started trying to prove before seeing the patch operate. What I did was create a little test program that created 5 databases and then dropped them: \timing create database a; create database b; create database c; create database d; create database e; drop database a; drop database b; drop database c; drop database d; drop database e; (All of the drop times were very close by the way; around 100ms, nothing particularly interesting there) If I have my system's boot drive (attached to the motherboard, not on the caching controller) in its regular, lying mode with write cache on, the creates take the following times: Time: 713.982 ms Time: 659.890 ms Time: 590.842 ms Time: 675.506 ms Time: 645.521 ms A second run gives similar results; seems quite repeatable for every test I ran so I'll just show one run of each. If I then turn off the write-cache on the drive: $ sudo hdparm -W 0 /dev/sdb And repeat, these times show up instead: Time: 6781.205 ms Time: 6805.271 ms Time: 6947.037 ms Time: 6938.644 ms Time: 7346.838 ms So there's the problem case reproduced, right on regular old ext3 and Ubuntu Jaunty: around 7 seconds to create a database, not real impressive. Applying the last patch you attached, with the cache on, I see this: Time: 396.105 ms Time: 389.984 ms Time: 469.800 ms Time: 386.043 ms Time: 441.269 ms And if I then turn the write cache off, back to slow times, but much better: Time: 2162.687 ms Time: 2174.057 ms Time: 2215.785 ms Time: 2174.100 ms Time: 2190.811 ms That makes the average times I'm seeing on my server: HEAD Cached: 657 ms Uncached: 6964 ms Patched Cached: 417 ms Uncached: 2183 ms Modest speedup even with a caching drive, and a huge speedup in the case when you have one with slow fsync. Looks to me that if you address Tom's concern about documentation and function naming, comitting this patch will certainly deliver as promised on the performance side. Maybe 2 seconds is still too long for some people, but it's at least a whole lot better. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.co -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance