Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian wrote: Greg Smith wrote: Kevin Grittner wrote: I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other I'd like a 10 minute argument please. I started to write something to refute this, only to clarify in my head the sequence of events that leads to the most questionable result, where I feel a bit less certain than I did before of the safety here. Here is the worst case I believe you're describing: 1) Transaction is written to the WAL and sync'd; client receives COMMIT. Since full_page_writes is off, the data in the WAL consists only of the delta of what changed on the page. 2) 8K database page is written to OS cache 3) PG calls fsync to force the database block out 4) OS writes first 4K block of the change to the BBU write cache. Worst case, this fills the cache, and it takes a moment for some random writes to process before it has space to buffer again (makes this more likely to happen, but it's not required to see the failure case here) 5) Sudden power interruption, second half of the page write is lost 6) Server restarts 7) That 4K write is now replayed from the battery's cache At this point, you now have a torn 8K page, with 1/2 old and 1/2 new Based on this report, I think we need to update our documentation and backpatch removal of text that says that BBU users can safely turn off full-page writes. Patch attached. I think we have fallen into a trap I remember from the late 1990's where I was assuming that an 8k-block based file system would write to the disk atomically in 8k segments, which of course it cannot. My bet is that even if you write to the kernel in 8k pages, and have an 8k file system, the disk is still accessed via 512-byte blocks, even with a BBU. Doc patch applied. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index a2724fa..1e67bbd 100644 *** /tmp/pgrevert.14281/7sLqTb_wal.sgml Tue Nov 30 21:57:17 2010 --- doc/src/sgml/wal.sgml Tue Nov 30 21:56:49 2010 *** *** 164,173 productnamePostgreSQL/ periodically writes full page images to permanent WAL storage emphasisbefore/ modifying the actual page on disk. By doing this, during crash recovery productnamePostgreSQL/ can !restore partially-written pages. If you have a battery-backed disk !controller or file-system software that prevents partial page writes !(e.g., ZFS), you can turn off this page imaging by turning off the !xref linkend=guc-full-page-writes parameter. /para /sect1 --- 164,175 productnamePostgreSQL/ periodically writes full page images to permanent WAL storage emphasisbefore/ modifying the actual page on disk. By doing this, during crash recovery productnamePostgreSQL/ can !restore partially-written pages. If you have file-system software !that prevents partial page writes (e.g., ZFS), you can turn off !this page imaging by turning off the xref !linkend=guc-full-page-writes parameter. Battery-Backed unit !(BBU) disk controllers do not prevent partial page writes unless !they guarantee that data is written to the BBU as full (8kB) pages. /para /sect1 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Is that true? I have no idea. I thought everything was done at the 512-byte block level. Newer disks (2TB and up) can have 4k sectors, but this still means a page spans several sectors. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Pierre C wrote: Is that true? I have no idea. I thought everything was done at the 512-byte block level. Newer disks (2TB and up) can have 4k sectors, but this still means a page spans several sectors. Yes, I had heard about that. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith wrote: Kevin Grittner wrote: I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other I'd like a 10 minute argument please. I started to write something to refute this, only to clarify in my head the sequence of events that leads to the most questionable result, where I feel a bit less certain than I did before of the safety here. Here is the worst case I believe you're describing: 1) Transaction is written to the WAL and sync'd; client receives COMMIT. Since full_page_writes is off, the data in the WAL consists only of the delta of what changed on the page. 2) 8K database page is written to OS cache 3) PG calls fsync to force the database block out 4) OS writes first 4K block of the change to the BBU write cache. Worst case, this fills the cache, and it takes a moment for some random writes to process before it has space to buffer again (makes this more likely to happen, but it's not required to see the failure case here) 5) Sudden power interruption, second half of the page write is lost 6) Server restarts 7) That 4K write is now replayed from the battery's cache At this point, you now have a torn 8K page, with 1/2 old and 1/2 new Based on this report, I think we need to update our documentation and backpatch removal of text that says that BBU users can safely turn off full-page writes. Patch attached. I think we have fallen into a trap I remember from the late 1990's where I was assuming that an 8k-block based file system would write to the disk atomically in 8k segments, which of course it cannot. My bet is that even if you write to the kernel in 8k pages, and have an 8k file system, the disk is still accessed via 512-byte blocks, even with a BBU. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index a2724fa..1e67bbd 100644 *** /tmp/pgrevert.14281/7sLqTb_wal.sgml Tue Nov 30 21:57:17 2010 --- doc/src/sgml/wal.sgml Tue Nov 30 21:56:49 2010 *** *** 164,173 productnamePostgreSQL/ periodically writes full page images to permanent WAL storage emphasisbefore/ modifying the actual page on disk. By doing this, during crash recovery productnamePostgreSQL/ can !restore partially-written pages. If you have a battery-backed disk !controller or file-system software that prevents partial page writes !(e.g., ZFS), you can turn off this page imaging by turning off the !xref linkend=guc-full-page-writes parameter. /para /sect1 --- 164,175 productnamePostgreSQL/ periodically writes full page images to permanent WAL storage emphasisbefore/ modifying the actual page on disk. By doing this, during crash recovery productnamePostgreSQL/ can !restore partially-written pages. If you have file-system software !that prevents partial page writes (e.g., ZFS), you can turn off !this page imaging by turning off the xref !linkend=guc-full-page-writes parameter. Battery-Backed unit !(BBU) disk controllers do not prevent partial page writes unless !they guarantee that data is written to the BBU as full (8kB) pages. /para /sect1 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Greg Smith g...@2ndquadrant.com wrote: I think Kevin's point here may be that if your fsync isn't reliable, you're always in trouble. But if your fsync is good, even torn pages should be repairable by the deltas written to the WAL I was actually just arguing that a BBU doesn't eliminate a risk here; if there is a risk with production-quality disk drives, there is a risk with a controller with a BBU cache. The BBU cache just tends to reduce the window of time in which corruption can occur. I wasn't too sure of *why* there was a risk, but Tom's post cleared that up. I wonder why we need to expose this GUC at all -- perhaps it should be off when fsync is off and on otherwise? Leaving it on without fsync is just harming performance for not much benefit, and turning it off with fsync seems to be saying that you are willing to tolerate a known risk of database corruption, just not quite so much as you have without fsync. In reality it seems most likely to be a mistake, either way. According to our docs, and my submitted patch, if you are using ZFS then you can turn off full-page writes, so full-page writes are still useful. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith wrote: Tom Lane wrote: You've got entirely too simplistic a view of what the delta might be, I fear. In particular there are various sorts of changes that involve inserting the data carried in the WAL record and shifting pre-existing data around to make room, or removing an item and moving remaining data around. If you try to replay that type of action against a torn page, you'll get corrupted results. I wasn't sure exactly how those were encoded, thanks for the clarification. Given that, it seems to me there are only two situations where full_page_writes is safe to turn off: 1) The operating system block size is exactly the same database block size, and all writes are guaranteed to be atomic to that block size. Is that true? I have no idea. I thought everything was done at the 512-byte block level. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 29, 2010 at 11:43 AM, Robert Haas robertmh...@gmail.com wrote: Well, we COULD keep the data in shared buffers, and then copy it into an mmap()'d region rather than calling write(), but I'm not sure there's any advantage to it. Managing address space mappings is a pain in the butt. I could see this being a *theoretical* benefit in the case that the background writer gains the ability to write out all blocks associated with a file in order. In that case, you might get a win because you could get a single mmap of the entire file, and just wholesale memcpy blocks across, then sync/unmap it. This, of course assumes a few things that must be for it to be per formant: 0) a list of blocks to be written grouped by files is readily available. 1) The pages you write to must be in the page cache, or your memcpy is going to fault them in. With a plain write, you don't need the over-written page in the cache. 2) Now, instead of the torn-page problem being FS block/sector sized base, you can now actually have a possibly arbitrary amount of the block memory written when the kernel writes out the page. you *really* need full-page-writes. 3) The mmap overhead required for the kernel to setup the mappings is less than the repeated syscalls of a simple write(). All those things seem like something that somebody could synthetically benchmark to prove value before even trying to bolt into PostgreSQL. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Robert Haas robertmh...@gmail.com writes: On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. Well, we COULD keep the data in shared buffers, and then copy it into an mmap()'d region rather than calling write(), but I'm not sure there's any advantage to it. Managing address space mappings is a pain in the butt. In principle that ought to be right about the same speed as using write() to copy the data from shared buffers to kernel disk buffers, anyway. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 29, 2010 at 11:56 AM, Aidan Van Dyk ai...@highrise.ca wrote: 1) The pages you write to must be in the page cache, or your memcpy is going to fault them in. With a plain write, you don't need the over-written page in the cache. I seem to remember a time many years ago when I got bitten by this problem. The fact that our I/O is in 8K pages means this could be a pretty severe hit, I think. 2) Now, instead of the torn-page problem being FS block/sector sized base, you can now actually have a possibly arbitrary amount of the block memory written when the kernel writes out the page. you *really* need full-page-writes. Yeah. 3) The mmap overhead required for the kernel to setup the mappings is less than the repeated syscalls of a simple write(). You'd expect to save something from that; but on the other hand, at least on 32-bit systems, there's a very limited number of 1GB files that can be simultaneously mapped into one address space, and it's a lot smaller than the number of file descriptors that you can have open. Rumor has it that cutting down the number of fds that can stay open simultaneously is pretty bad for performance, so cutting it down to a number you can count on one hand (maybe one finger) would probably be bad. Maybe on 64-bit it would be OK but it seems like an awful lot of complexity for at most a minor savings (and a pretty bad anti-savings if point #1 kicks in). Anyway this is all totally off-topic... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Excerpts from Greg Smith's message of jue oct 21 14:04:17 -0300 2010: What I would like to do is beef up the documentation with some concrete examples of how to figure out if your cache and associated write path are working reliably or not. It should be possible to include does this handle full page writes correctly? in that test suite. Until we have something like that, I'm concerned that bugs in filesystem or controller handling may make full_page_writes unsafe even with a BBU, and we'd have no way for people to tell if that's true or not. I think if you assume that there are bugs in the filesystem which you need to protect against, you are already hosed. I imagine there must be some filesystem bug that makes it safe to have full_page_writes=on, but unsafe to have full_page_writes=off; but I'd probably discard those as a rare minority and thus not worth worrying about. I agree it would be worth testing though. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, 29 Oct 2010, Robert Haas wrote: On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: James Mansion ja...@mansionfamily.plus.com writes: Tom Lane wrote: The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) Well, I agree that it sucks for the reason you give - but you use write and that's *exactly* the same in terms of when it gets written, as when you update a byte on an mmap'd page. Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we say so. If the buffer were mmap'd then we'd have no control over that, which makes it mighty hard to obey the WAL write log before data paradigm. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. Well, we COULD keep the data in shared buffers, and then copy it into an mmap()'d region rather than calling write(), but I'm not sure there's any advantage to it. Managing address space mappings is a pain in the butt. keep in mind that you have no way of knowing what order the data in the mmap region gets written out to disk. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Tom Lane wrote: Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we Well, I don't know where yu got the idea I was refering to that sort of thing - its the same as writing to a buffer before copying to the mmap'd area. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do Which is what I was refering to. care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. I think you have just the same control either way, because you can only force ordering with an appropriate explicit sync, and in the absence of such a sync all bets are off for whether/when each disk page is written out, and if you can't ensure that the controller and disk are write through you'd better do a hardware cache flush.too, right? A shame that so many systems have relatively poor handling of that hardware flush. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, 29 Oct 2010, James Mansion wrote: Tom Lane wrote: Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we Well, I don't know where yu got the idea I was refering to that sort of thing - its the same as writing to a buffer before copying to the mmap'd area. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do Which is what I was refering to. care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. I think you have just the same control either way, because you can only force ordering with an appropriate explicit sync, and in the absence of such a sync all bets are off for whether/when each disk page is written out, and if you can't ensure that the controller and disk are write through you'd better do a hardware cache flush.too, right? A shame that so many systems have relatively poor handling of that hardware flush. the issue is that when you update a mmaped chunk of data, it could be written out immediatly without you doing _anything_ (and thanks to multiple cores/threads, it could get written out while you are still in the middle of updating it). When you update an internal buffer and then write that, you know that nothing will hit the disk before you issue the write command. David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Tom Lane wrote: The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) Well, I agree that it sucks for the reason you give - but you use write and that's *exactly* the same in terms of when it gets written, as when you update a byte on an mmap'd page. And you're quite happy to use write. The only difference is that its a lot more explicit where the point of 'maybe its written and maybe it isn't' occurs. There need be no real difference in the architecture for one over the other: there does seem to be evidence that write and read can have better forward-read and write-behind behaviour, because read/write does allow you to initiate an IO with a hint to a size that exceeds a hardware page. And yes, after getting into the details while starting to port TC to Windows, I decided to bin it. Especially handy that SQLite3 has WAL now. (And one last dig - TC didn't even have a checksum that would let you tell when it had been broken: but it might all be fixed now of course, I don't have time to check.) James -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
James Mansion ja...@mansionfamily.plus.com writes: Tom Lane wrote: The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) Well, I agree that it sucks for the reason you give - but you use write and that's *exactly* the same in terms of when it gets written, as when you update a byte on an mmap'd page. Uh, no, it is not. The difference is that we can update a byte in a shared buffer, and know that it *isn't* getting written out before we say so. If the buffer were mmap'd then we'd have no control over that, which makes it mighty hard to obey the WAL write log before data paradigm. It's true that we don't know whether write() causes an immediate or delayed disk write, but we generally don't care that much. What we do care about is being able to ensure that a WAL write happens before the data write, and with mmap we don't have control over that. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch wult...@gmail.com wrote: On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote: The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Baron Schwartz was talking to me about this at Surge. I don't really understand how the fuzzy checkpoint stuff works, and I haven't been able to find a good description of it anywhere. How does it keep dirty pages in memory longer? Details on the other things you mention would be interesting to hear, too. For checkpoint behavior: http://books.google.com/books?id=S_yHERPRZScCpg=PA606lpg=PA606dq=fuzzy+checkpointsource=blots=JJrzRUKBGhsig=UOMPsRy5E-YDgjAFkaSVn3dps_Mhl=enei=_k8yTOfeHYzZnAepyumLBAsa=Xoi=book_resultct=resultresnum=8ved=0CEYQ6AEwBw#v=onepageq=fuzzy%20checkpointf=false I would think that best case behavior sharp checkpoints with a large checkpoint_completion_target would have behavior similar to a fuzzy checkpoint. Well, under that definition of a fuzzy checkpoint, our checkpoints are fuzzy even with checkpoint_completion_target=0. What Baron seemed to be describing was a scheme whereby you could do what I might call partial checkpoints. IOW, you want to move the redo pointer without writing out ALL the dirty buffers in memory, so you write out the pages with the oldest LSNs and then move the redo pointer to the oldest LSN you have left. Except that doesn't quite work, because the page might have been dirtied at LSN X and then later updated again at LSN Y, and you still have to flush it to disk before moving the redo pointer to any value X. So you work around that by maintaining a first dirtied LSN for each page as well as the current LSN. I'm not 100% sure that this is how it works or that it would work in PG, but even assuming that it is and does, I'm not sure what the benefit is over the checkpoint-spreading logic we have now. There might be some benefit in sorting the writes that we do, so that we can spread out the fsyncs. So, write all the blocks to a give file, fsync, and then repeat for each underlying data file that has at least one dirty block. But that's completely orthogonal to (and would actually be hindered by) the approach described in the preceding paragraph. Insert (for innodb 1.1+ evidently there is also does delete and purge) buffering: http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html We do something a bit like this for GIST indices. It would be interesting to see if it also has a benefit for btree indices. For a recent ~800GB db I had to restore, the insert buffer saved 92% of io needed for secondary indexes. Compression: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html For many workloads 50% compression results in negligible impact to performance. For certain workloads compression can help performance. Please note that InnoDB also has non-tunable toast like feature. Interesting. I am surprised this works well. It seems that this only works for pages that can be compressed by =50%, which seems like it could result in a lot of CPU wasted on failed attempts to compress. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... So how is InnoDB doing replication? Is there a second log just for that? The other log is the binary log and it is one of the biggest problems with MySQL. Running MySQL in such a way that the binary log stays in sync with the InnoDB redo has a very significant impact on performance. http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html (check out the pretty graph) Hmm. That seems kinda painful. Having to ship full page images over the wire doesn't seems so bad by comparison, though I'm not very happy about having to do that either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch wult...@gmail.com wrote: On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote: The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Baron Schwartz was talking to me about this at Surge. I don't really understand how the fuzzy checkpoint stuff works, and I haven't been able to find a good description of it anywhere. How does it keep dirty pages in memory longer? Details on the other things you mention would be interesting to hear, too. For checkpoint behavior: http://books.google.com/books?id=S_yHERPRZScCpg=PA606lpg=PA606dq=fuzzy+checkpointsource=blots=JJrzRUKBGhsig=UOMPsRy5E-YDgjAFkaSVn3dps_Mhl=enei=_k8yTOfeHYzZnAepyumLBAsa=Xoi=book_resultct=resultresnum=8ved=0CEYQ6AEwBw#v=onepageq=fuzzy%20checkpointf=false I would think that best case behavior sharp checkpoints with a large checkpoint_completion_target would have behavior similar to a fuzzy checkpoint. Well, under that definition of a fuzzy checkpoint, our checkpoints are fuzzy even with checkpoint_completion_target=0. What Baron seemed to be describing was a scheme whereby you could do what I might call partial checkpoints. IOW, you want to move the redo pointer without writing out ALL the dirty buffers in memory, so you write out the pages with the oldest LSNs and then move the redo pointer to the oldest LSN you have left. Except that doesn't quite work, because the page might have been dirtied at LSN X and then later updated again at LSN Y, and you still have to flush it to disk before moving the redo pointer to any value X. So you work around that by maintaining a first dirtied LSN for each page as well as the current LSN. I'm not 100% sure that this is how it works or that it would work in PG, but even assuming that it is and does, I'm not sure what the benefit is over the checkpoint-spreading logic we have now. There might be some benefit in sorting the writes that we do, so that we can spread out the fsyncs. So, write all the blocks to a give file, fsync, and then repeat for each underlying data file that has at least one dirty block. But that's completely orthogonal to (and would actually be hindered by) the approach described in the preceding paragraph. I wish I could answer your questions better. I am a power user that does not fully understand InnoDB internals. There are not all that many folks that have a very good understanding of InnoDB internals (given how well it works there is not all that much need). Insert (for innodb 1.1+ evidently there is also does delete and purge) buffering: http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html We do something a bit like this for GIST indices. It would be interesting to see if it also has a benefit for btree indices. For a recent ~800GB db I had to restore, the insert buffer saved 92% of io needed for secondary indexes. Compression: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html For many workloads 50% compression results in negligible impact to performance. For certain workloads compression can help performance. Please note that InnoDB also has non-tunable toast like feature. Interesting. I am surprised this works well. It seems that this only works for pages that can be compressed by =50%, which seems like it could result in a lot of CPU wasted on failed attempts to compress. In my world, the spinning disk is almost always the bottleneck. Trading CPU for IO is almost always a good deal for me. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... So how is InnoDB doing replication? Is there a second log just for that? The other log is the binary log and it is one of the biggest problems with MySQL. Running MySQL in such a way that the binary log stays in sync with the InnoDB redo has a very significant impact on performance. http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html (check out the pretty graph) Hmm. That seems kinda painful. Having to ship full page images over the wire doesn't seems so bad by comparison, though I'm not very happy about having to do that either. The binary log is less than ideal, but with MySQL replication I can replicate
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Rob Wultsch wult...@gmail.com wrote: I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. For a reasonably brief description of InnoDB double write buffers, I found this: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ One big question before even considering this would by how to determine whether a potentially torn page is inconsistent. Without a page CRC or some such mechanism, I don't see how this technique is possible. There are two sides to this problem: figuring out when to write a page to the double write buffer, and figuring out when to read it back from the double write buffer. The first seems easy: we just do it whenever we would XLOG a full page image. As to the second, when we write the page out to the double write buffer, we could also write to the double write buffer the LSN of the WAL record which depends on that full page image. Then, at the start of recovery, we scan the double write buffer and remember all those LSNs. When we reach one of them, we replay the full page image. The good thing about this is that it would reduce WAL volume; the bad thing about it is that it would probably mean doing two fsyncs where we only now do one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Tue, Oct 26, 2010 at 5:41 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Rob Wultsch wult...@gmail.com wrote: I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. For a reasonably brief description of InnoDB double write buffers, I found this: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ One big question before even considering this would by how to determine whether a potentially torn page is inconsistent. Without a page CRC or some such mechanism, I don't see how this technique is possible. There are two sides to this problem: figuring out when to write a page to the double write buffer, and figuring out when to read it back from the double write buffer. The first seems easy: we just do it whenever we would XLOG a full page image. As to the second, when we write the page out to the double write buffer, we could also write to the double write buffer the LSN of the WAL record which depends on that full page image. Then, at the start of recovery, we scan the double write buffer and remember all those LSNs. When we reach one of them, we replay the full page image. The good thing about this is that it would reduce WAL volume; the bad thing about it is that it would probably mean doing two fsyncs where we only now do one. The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote: The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Baron Schwartz was talking to me about this at Surge. I don't really understand how the fuzzy checkpoint stuff works, and I haven't been able to find a good description of it anywhere. How does it keep dirty pages in memory longer? Details on the other things you mention would be interesting to hear, too. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... So how is InnoDB doing replication? Is there a second log just for that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas robertmh...@gmail.com wrote: On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote: The double write buffer is one of the few areas where InnoDB does more IO (in the form of fsynch's) than PG. InnoDB also has fuzzy checkpoints (which help to keep dirty pages in memory longer), buffering of writing out changes to secondary indexes, and recently tunable page level compression. Baron Schwartz was talking to me about this at Surge. I don't really understand how the fuzzy checkpoint stuff works, and I haven't been able to find a good description of it anywhere. How does it keep dirty pages in memory longer? Details on the other things you mention would be interesting to hear, too. For checkpoint behavior: http://books.google.com/books?id=S_yHERPRZScCpg=PA606lpg=PA606dq=fuzzy+checkpointsource=blots=JJrzRUKBGhsig=UOMPsRy5E-YDgjAFkaSVn3dps_Mhl=enei=_k8yTOfeHYzZnAepyumLBAsa=Xoi=book_resultct=resultresnum=8ved=0CEYQ6AEwBw#v=onepageq=fuzzy%20checkpointf=false I would think that best case behavior sharp checkpoints with a large checkpoint_completion_target would have behavior similar to a fuzzy checkpoint. Insert (for innodb 1.1+ evidently there is also does delete and purge) buffering: http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html For a recent ~800GB db I had to restore, the insert buffer saved 92% of io needed for secondary indexes. Compression: http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html For many workloads 50% compression results in negligible impact to performance. For certain workloads compression can help performance. Please note that InnoDB also has non-tunable toast like feature. Given that InnoDB is not shipping its logs across the wire, I don't think many users would really care if it used the double writer or full page writes approach to the redo log (other than the fact that the log files would be bigger). PG on the other hand *is* pushing its logs over the wire... So how is InnoDB doing replication? Is there a second log just for that? The other log is the binary log and it is one of the biggest problems with MySQL. Running MySQL in such a way that the binary log stays in sync with the InnoDB redo has a very significant impact on performance. http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/ http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html (check out the pretty graph) If you are going to West you should considering heading over to the Facebook office on Tuesday as the MySQL team is having something of an open house: http://www.facebook.com/event.php?eid=160712450628622 Mark Callaghan from the Facebook MySQL Engineering (and several members of their ops team, for that matter) team understands InnoDB dramatically better than I do. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? Surely 'the data is persisted sometime after our write and before the fsynch returns, but may be written: - in small chunks - out of order - in an unpredictable way' When I looked at the internals of TokyoCabinet for example, the design was flawed but would be 'fairly robust' so long as mmap'd pages that were dirtied did not get persisted until msync, and were then persisted atomically. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
James Mansion wrote: When I looked at the internals of TokyoCabinet for example, the design was flawed but would be 'fairly robust' so long as mmap'd pages that were dirtied did not get persisted until msync, and were then persisted atomically. If TokyoCabinet presumes that's true and overwrites existing blocks with that assumption, it would land onto my list of databases I wouldn't trust to hold my TODO list. Flip off power to a server, and you have no idea what portion of the blocks sitting in the drive's cache actually made it to disk; that's not even guaranteed atomic to the byte level. Torn pages happen all the time unless you either a) put the entire write into a non-volatile cache before writing any of it, b) write and sync somewhere else first and then do a journaled filesystem pointer swap from the old page to the new one, or c) journal the whole write the way PostgreSQL does with full_page_writes and the WAL. The discussion here veered off over whether (a) was sufficiently satisfied just by having a RAID controller with battery backup, and what I concluded from the dive into the details is that it's definitely not true unless the filesystem block size exactly matches the database one. And even then, make sure you test heavily. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Jesper Krogh wrote: Can you point to some ZFS docs that tell that this is the case.. I'd be surprised if it doesnt copy away the old block and replaces it with the new one in-place. The other behaviour would quite quickly lead to a hugely fragmented filesystem that performs next to useless and ZFS doesnt seem to be in that category.. http://all-unix.blogspot.com/2007/03/zfs-cow-and-relate-features.html Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. http://opensolaris.org/jive/thread.jspa?messageID=19264 discusses how this interacts with the common types of hardware around: no guaratees with lying hard drives as always, but otherwise you're fine. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith g...@2ndquadrant.com writes: James Mansion wrote: When I looked at the internals of TokyoCabinet for example, the design was flawed but would be 'fairly robust' so long as mmap'd pages that were dirtied did not get persisted until msync, and were then persisted atomically. If TokyoCabinet presumes that's true and overwrites existing blocks with that assumption, it would land onto my list of databases I wouldn't trust to hold my TODO list. Flip off power to a server, and you have no idea what portion of the blocks sitting in the drive's cache actually made it to disk; that's not even guaranteed atomic to the byte level. The other and probably worse problem is that there's no application control over how soon changes to mmap'd pages get to disk. An msync will flush them out, but the kernel is free to write dirty pages sooner. So if they're depending for consistency on writes not happening until msync, it's broken by design. (This is one of the big reasons we don't use mmap'd space for Postgres disk buffers.) regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Rob Wultsch wrote: I really would like to work with PG more and this seems like [full_page_writes] would be a significant hindrance for certain usage patterns. Lots of replication does not take place over gig... Certainly most of the Wisconsin State Courts replication takes place over WAN connections at a few Mbps. I haven't seen any evidence that having full_page_writes on has caused us problems, personally. In the PostgreSQL community you generally need to show some hard numbers from a repeatable test case for the community to believe that there's a problem which needs fixing, much less to buy in to some particular fix for the purported problem. On the other hand, if you can show that there actually *is* a problem, I've never seen a group which responds so quickly and effectively to solve it as the PostgreSQL community. Don't get too attached to a particular solution without proof that it's better than the alternatives, though -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Even if it's possible, it's far from clear to me that it would be an improvement. The author estimates (apparently somewhat loosely) that it's a 5% to 10% performance hit in InnoDB; I'm far from certain that full_page_writes cost us that much. Does anyone have benchmark numbers handy? It most certainly can, depending on your CPU saturation and I/O support. I've seen a 10% improvement in througput time from turning off full_page_writes on some machines, such as when we were doing the SpecJAppserver benchmarks on Solaris. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith g...@2ndquadrant.com wrote: I think Kevin's point here may be that if your fsync isn't reliable, you're always in trouble. But if your fsync is good, even torn pages should be repairable by the deltas written to the WAL I was actually just arguing that a BBU doesn't eliminate a risk here; if there is a risk with production-quality disk drives, there is a risk with a controller with a BBU cache. The BBU cache just tends to reduce the window of time in which corruption can occur. I wasn't too sure of *why* there was a risk, but Tom's post cleared that up. I wonder why we need to expose this GUC at all -- perhaps it should be off when fsync is off and on otherwise? Leaving it on without fsync is just harming performance for not much benefit, and turning it off with fsync seems to be saying that you are willing to tolerate a known risk of database corruption, just not quite so much as you have without fsync. In reality it seems most likely to be a mistake, either way. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Tom Lane wrote: You've got entirely too simplistic a view of what the delta might be, I fear. In particular there are various sorts of changes that involve inserting the data carried in the WAL record and shifting pre-existing data around to make room, or removing an item and moving remaining data around. If you try to replay that type of action against a torn page, you'll get corrupted results. I wasn't sure exactly how those were encoded, thanks for the clarification. Given that, it seems to me there are only two situations where full_page_writes is safe to turn off: 1) The operating system block size is exactly the same database block size, and all writes are guaranteed to be atomic to that block size. 2) You're using a form of journaled filesystem where data blocks are never updated, they're always written elsewhere and the filesystem is redirected to that new block once it's on disk. Looks to me like whether or not there's a non-volatile write cache sitting in the middle, like a BBU protected RAID card, doesn't really make any difference here then. I think that most people who have thought they were safe to turn off full_page_writes in the past did so because they believed they were in category (1) here. I've never advised anyone to do that, because it's so difficult to validate the truth of. Just given that, I'd be tempted to join in on suggesting this parameter just go away in the name of safety, except that I think category (2) here is growing now. ZFS is the most obvious example where the atomic write implementation seems to always make disabling full_page_writes safe. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On 2010-10-22 17:37, Greg Smith wrote: I think that most people who have thought they were safe to turn off full_page_writes in the past did so because they believed they were in category (1) here. I've never advised anyone to do that, because it's so difficult to validate the truth of. Just given that, I'd be tempted to join in on suggesting this parameter just go away in the name of safety, except that I think category (2) here is growing now. ZFS is the most obvious example where the atomic write implementation seems to always make disabling full_page_writes safe. Can you point to some ZFS docs that tell that this is the case.. I'd be surprised if it doesnt copy away the old block and replaces it with the new one in-place. The other behaviour would quite quickly lead to a hugely fragmented filesystem that performs next to useless and ZFS doesnt seem to be in that category.. ... All given my total lack of insight into ZFS. -- Jesper
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 8:37 AM, Greg Smith g...@2ndquadrant.com wrote: Tom Lane wrote: You've got entirely too simplistic a view of what the delta might be, I fear. In particular there are various sorts of changes that involve inserting the data carried in the WAL record and shifting pre-existing data around to make room, or removing an item and moving remaining data around. If you try to replay that type of action against a torn page, you'll get corrupted results. I wasn't sure exactly how those were encoded, thanks for the clarification. Given that, it seems to me there are only two situations where full_page_writes is safe to turn off: 1) The operating system block size is exactly the same database block size, and all writes are guaranteed to be atomic to that block size. 2) You're using a form of journaled filesystem where data blocks are never updated, they're always written elsewhere and the filesystem is redirected to that new block once it's on disk. Looks to me like whether or not there's a non-volatile write cache sitting in the middle, like a BBU protected RAID card, doesn't really make any difference here then. I think that most people who have thought they were safe to turn off full_page_writes in the past did so because they believed they were in category (1) here. I've never advised anyone to do that, because it's so difficult to validate the truth of. Just given that, I'd be tempted to join in on suggesting this parameter just go away in the name of safety, except that I think category (2) here is growing now. ZFS is the most obvious example where the atomic write implementation seems to always make disabling full_page_writes safe. For the sake of argument, has PG considered using a double write buffer similar to InnodB? -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Rob Wultsch wult...@gmail.com wrote: has PG considered using a double write buffer similar to InnodB? That seems inferior to the full_page_writes strategy, where you only write a page twice the first time it is written after a checkpoint. We're talking about when we might be able to write *less*, not more. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 10:28 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Rob Wultsch wult...@gmail.com wrote: has PG considered using a double write buffer similar to InnodB? That seems inferior to the full_page_writes strategy, where you only write a page twice the first time it is written after a checkpoint. We're talking about when we might be able to write *less*, not more. -Kevin By write do you mean number of writes, or the number of bytes of the writes? For number of writes, yes a double write buffer will lose. In terms of number of bytes, I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Rob Wultsch wult...@gmail.com wrote: I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. For a reasonably brief description of InnoDB double write buffers, I found this: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ One big question before even considering this would by how to determine whether a potentially torn page is inconsistent. Without a page CRC or some such mechanism, I don't see how this technique is possible. Even if it's possible, it's far from clear to me that it would be an improvement. The author estimates (apparently somewhat loosely) that it's a 5% to 10% performance hit in InnoDB; I'm far from certain that full_page_writes cost us that much. Does anyone have benchmark numbers handy? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Rob Wultsch wult...@gmail.com wrote: I would think full_page_writes=off + double write buffer should be far superior, particularly given that the WAL is shipped over the network to slaves. For a reasonably brief description of InnoDB double write buffers, I found this: http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/ One big question before even considering this would by how to determine whether a potentially torn page is inconsistent. Without a page CRC or some such mechanism, I don't see how this technique is possible. Even if it's possible, it's far from clear to me that it would be an improvement. The author estimates (apparently somewhat loosely) that it's a 5% to 10% performance hit in InnoDB; I'm far from certain that full_page_writes cost us that much. Does anyone have benchmark numbers handy? -Kevin Ignoring (briefly) the cost in terms of performance of the different system, not needing full_page_writes would make geographically dispersed replication possible for certain cases where it is not currently (or at least rather painful). -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Rob Wultsch wult...@gmail.com wrote: not needing full_page_writes would make geographically dispersed replication possible for certain cases where it is not currently (or at least rather painful). Do you have any hard numbers on WAL file size impact? How much does pglesslog help in a file-based WAL transmission environment? Should we be considering similar filtering for streaming replication? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Fri, Oct 22, 2010 at 1:15 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Rob Wultsch wult...@gmail.com wrote: not needing full_page_writes would make geographically dispersed replication possible for certain cases where it is not currently (or at least rather painful). Do you have any hard numbers on WAL file size impact? How much does pglesslog help in a file-based WAL transmission environment? Should we be considering similar filtering for streaming replication? -Kevin No, I am DBA that mostly works on MySQL. I have had to deal with (handwaving...) tangential issues recently. I really would like to work with PG more and this seems like it would be a significant hindrance for certain usage patterns. Lots of replication does not take place over gig... -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On 10/20/2010 09:45 PM, Scott Marlowe wrote: On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drakej...@commandprompt.com wrote: On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: Ben Chobot wrote: On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. Totally agree! BBU first, more spindles second. Agreed. note that while you can get incredible burst performance from a battery backed cache, due to both caching and writing out of order, once the throughput begins to saturate at the speed of the disk array, the bbu cache is now only re-ordering really, as it will eventually fill up faster than the disks can take the writes, and you'll settle in at some percentage of your max tps you get for a short benchmark run. It's vitally important that once you put a BBU cache in place, you run a very long running transactional test (pgbench is a simple one to start with) that floods the io subsystem so you see what you're average throughput is with the WAL and data store getting flooded. I know on my system pgbench runs of a few minutes can be 3 or 4 times faster than runs that last for the better part of an hour. Thanks for all the replies. This is what I suspected but since I can't just buy one of everything to try, I wanted a sanity-check before spending the $$$. I am not too worried about saturating the controller cache as the current much lower spec machine can handle the sustained load just fine and the bursts are typically only 1-3 seconds long spaced a minute or more apart. Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Scott Marlowe wrote: On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake j...@commandprompt.com wrote: On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: Ben Chobot wrote: On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. Totally agree! BBU first, more spindles second. Agreed. note that while you can get incredible burst performance from a battery backed cache, due to both caching and writing out of order, once the throughput begins to saturate at the speed of the disk array, the bbu cache is now only re-ordering really, as it will eventually fill up faster than the disks can take the writes, and you'll settle in at some percentage of your max tps you get for a short benchmark run. It's vitally important that once you put a BBU cache in place, you run a very long running transactional test (pgbench is a simple one to start with) that floods the io subsystem so you see what you're average throughput is with the WAL and data store getting flooded. I know on my system pgbench runs of a few minutes can be 3 or 4 times faster than runs that last for the better part of an hour. With a BBU you can turn off full_page_writes, which should decrease the WAL traffic. However, I don't see this mentioned in our documentation. Should I add it? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian br...@momjian.us wrote: With a BBU you can turn off full_page_writes My understanding is that that is not without risk. What happens if the WAL is written, there is a commit, but the data page has not yet been written to the controller? Don't we still have a torn page? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: With a BBU you can turn off full_page_writes My understanding is that that is not without risk. What happens if the WAL is written, there is a commit, but the data page has not yet been written to the controller? Don't we still have a torn page? I don't see how full_page_writes affect non-written pages to the controller. full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to, and the cache will not be cleared until the drive has fully writen the data to disk. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian wrote: With a BBU you can turn off full_page_writes, which should decrease the WAL traffic. However, I don't see this mentioned in our documentation. Should I add it? What I would like to do is beef up the documentation with some concrete examples of how to figure out if your cache and associated write path are working reliably or not. It should be possible to include does this handle full page writes correctly? in that test suite. Until we have something like that, I'm concerned that bugs in filesystem or controller handling may make full_page_writes unsafe even with a BBU, and we'd have no way for people to tell if that's true or not. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian br...@momjian.us wrote: full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: full_page_writes is designed to guard against a partial write to a device. I don't think the raid cache can be partially written to So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. That's why full_page_writes should be safe on a system with BBU as Bruce is suggesting. But I'd like to see some independent proof of that fact, that includes some targeted tests users can run, before we start recommending that practice. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith g...@2ndquadrant.com wrote: Kevin Grittner wrote: So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. I'm not concerned about an fsync after the controller has it; I'm concerned about a system crash in the middle of writing an 8K page to the controller. Other than the expected *size* of the window of time during which you're vulnerable, what does a BBU caching controller buy you in this regard? Can't the OS rearrange the writes of disk sectors after the 8K page is written to the OS cache so that the window might occasionally be rather large? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Greg Smith g...@2ndquadrant.com wrote: Kevin Grittner wrote: So you're confident that an 8kB write to the controller will not be done as a series of smaller atomic writes by the OS file system? Sure, that happens. But if the BBU has gotten an fsync call after the 8K write, it shouldn't return success until after all 8K are in its cache. I'm not concerned about an fsync after the controller has it; I'm concerned about a system crash in the middle of writing an 8K page to the controller. Other than the expected *size* of the window of time during which you're vulnerable, what does a BBU caching controller buy you in this regard? Can't the OS rearrange the writes of disk sectors after the 8K page is written to the OS cache so that the window might occasionally be rather large? If the write fails to the controller, the page is not flushed and PG does not continue. If the write fails, the fsync never happens, and hence PG stops. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Bruce Momjian br...@momjian.us wrote: I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? Now that is an interesting question. We write 8k to the kernel, but the kernel doesn't have to honor those write sizes, so while we probably can't get a partial 512-byte block written to disk with an BBU (that isn't cleanup up by the BBU on reboot), we could get some 512-byte blocks of an 8k written and others not. I agree you are right and a BBU does not mean you can safely turn off full_page_writes. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: I assume we send a full 8k to the controller, and a failure during that write is not registered as a write. On what do you base that assumption? I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other? At least on linux pages can certainly get written out in 8kb batches if youre under memory pressure. Andres -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: I assume that we send a full 8K to the OS cache, and the file system writes disk sectors according to its own algorithm. With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other I'd like a 10 minute argument please. I started to write something to refute this, only to clarify in my head the sequence of events that leads to the most questionable result, where I feel a bit less certain than I did before of the safety here. Here is the worst case I believe you're describing: 1) Transaction is written to the WAL and sync'd; client receives COMMIT. Since full_page_writes is off, the data in the WAL consists only of the delta of what changed on the page. 2) 8K database page is written to OS cache 3) PG calls fsync to force the database block out 4) OS writes first 4K block of the change to the BBU write cache. Worst case, this fills the cache, and it takes a moment for some random writes to process before it has space to buffer again (makes this more likely to happen, but it's not required to see the failure case here) 5) Sudden power interruption, second half of the page write is lost 6) Server restarts 7) That 4K write is now replayed from the battery's cache At this point, you now have a torn 8K page, with 1/2 old and 1/2 new data. Without a full page write in the WAL, is it always possible to restore its original state now? In theory, I think you do. Since the delta in the WAL should be overwriting all of the bytes that changed between the old and new version of the page, applying it on top of any four possible states here: 1) None of the data was written to the database page yet 2) The first 4K of data was written out 3) The second 4K of data was written out 4) All 8K was actually written out Should lead to the same result: an 8K page that includes the change that was in the WAL but not onto disk at the point when the crash happened. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Greg Smith g...@2ndquadrant.com writes: At this point, you now have a torn 8K page, with 1/2 old and 1/2 new data. Right. Without a full page write in the WAL, is it always possible to restore its original state now? In theory, I think you do. Since the delta in the WAL should be overwriting all of the bytes that changed between the old and new version of the page, applying it on top of any four possible states here: You've got entirely too simplistic a view of what the delta might be, I fear. In particular there are various sorts of changes that involve inserting the data carried in the WAL record and shifting pre-existing data around to make room, or removing an item and moving remaining data around. If you try to replay that type of action against a torn page, you'll get corrupted results. We could possibly redefine the WAL records so that they weren't just the minimum amount of data but carried every byte that'd changed on the page, and then I think what you're envisioning would work. But the records would be a lot bulkier. It's not clear to me that this would be a net savings over the current design, particularly not if there's a long interval between checkpoints. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Kevin Grittner wrote: With either platters or BBU cache, the data is persisted on fsync; why do you see a risk with one but not the other Forgot to address this part. The troublesome sequence if you don't have a BBU is: 1) WAL data is written to the OS cache 2) PG calls fsync 3) Data is tranferred into the drive's volatile, non battery-backed cache 4) Drive lies about data being on disk, says fsync is done 5) That 8K data page is written out to the OS cache, also with fsync, then onto the drive. It says it has that too. 6) Due to current disk head location, 4KB of the data page gets written out before it gets to the WAL data 7) System crashes Now you're dead. You've just torn a data page, but not written any of the data to the WAL necessary to reconstruct any valid version of that page. I think Kevin's point here may be that if your fsync isn't reliable, you're always in trouble. But if your fsync is good, even torn pages should be repairable by the deltas written to the WAL, as I described in the message I just sent before this one. That's true regardless of whether you achieved non-lying fsync with a BBU or by turning a drive's write cache off. There's nothing really special about the BBU beyond it behind the most common form of reliable write cache that works. You get the same properties at a slower rate with a drive that's configured to never lie about writes. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Supportwww.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
Ben Chobot wrote: On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. Totally agree! -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake j...@commandprompt.com wrote: On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote: Ben Chobot wrote: On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. Totally agree! BBU first, more spindles second. Agreed. note that while you can get incredible burst performance from a battery backed cache, due to both caching and writing out of order, once the throughput begins to saturate at the speed of the disk array, the bbu cache is now only re-ordering really, as it will eventually fill up faster than the disks can take the writes, and you'll settle in at some percentage of your max tps you get for a short benchmark run. It's vitally important that once you put a BBU cache in place, you run a very long running transactional test (pgbench is a simple one to start with) that floods the io subsystem so you see what you're average throughput is with the WAL and data store getting flooded. I know on my system pgbench runs of a few minutes can be 3 or 4 times faster than runs that last for the better part of an hour. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] BBU Cache vs. spindles
On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote: I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL spindles are nice and all, but they're still spinning media. Raid card cache is wy faster, and while it's best at bursty writes, it sounds like bursty writes are precisely what you have. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] BBU Cache vs. spindles
I'm weighing options for a new server. In addition to PostgreSQL, this machine will handle some modest Samba and Rsync load. I will have enough RAM so the virtually all disk-read activity will be cached. The average PostgreSQL read activity will be modest - a mix of single-record and fairly large (reporting) result-sets. Writes will be modest as well but will come in brief (1-5 second) bursts of individual inserts. The rate of insert requests will hit 100-200/second for those brief bursts. So... Am I likely to be better off putting $$$ toward battery-backup on the RAID or toward adding a second RAID-set and splitting off the WAL traffic? Or something else? Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance