Re: [PERFORM] BBU Cache vs. spindles

2010-12-22 Thread Bruce Momjian
Bruce Momjian wrote:
 Greg Smith wrote:
  Kevin Grittner wrote:
   I assume that we send a full
   8K to the OS cache, and the file system writes disk sectors
   according to its own algorithm.  With either platters or BBU cache,
   the data is persisted on fsync; why do you see a risk with one but
   not the other
  
  I'd like a 10 minute argument please.  I started to write something to 
  refute this, only to clarify in my head the sequence of events that 
  leads to the most questionable result, where I feel a bit less certain 
  than I did before of the safety here.  Here is the worst case I believe 
  you're describing:
  
  1) Transaction is written to the WAL and sync'd; client receives 
  COMMIT.  Since full_page_writes is off, the data in the WAL consists 
  only of the delta of what changed on the page.
  2) 8K database page is written to OS cache
  3) PG calls fsync to force the database block out
  4) OS writes first 4K block of the change to the BBU write cache.  Worst 
  case, this fills the cache, and it takes a moment for some random writes 
  to process before it has space to buffer again (makes this more likely 
  to happen, but it's not required to see the failure case here)
  5) Sudden power interruption, second half of the page write is lost
  6) Server restarts
  7) That 4K write is now replayed from the battery's cache
  
  At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 
 
 Based on this report, I think we need to update our documentation and
 backpatch removal of text that says that BBU users can safely turn off
 full-page writes.  Patch attached.
 
 I think we have fallen into a trap I remember from the late 1990's where
 I was assuming that an 8k-block based file system would write to the
 disk atomically in 8k segments, which of course it cannot.  My bet is
 that even if you write to the kernel in 8k pages, and have an 8k file
 system, the disk is still accessed via 512-byte blocks, even with a BBU.

Doc patch applied.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml	Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml	Tue Nov 30 21:56:49 2010
***
*** 164,173 
 productnamePostgreSQL/ periodically writes full page images to
 permanent WAL storage emphasisbefore/ modifying the actual page on
 disk. By doing this, during crash recovery productnamePostgreSQL/ can
!restore partially-written pages.  If you have a battery-backed disk
!controller or file-system software that prevents partial page writes
!(e.g., ZFS),  you can turn off this page imaging by turning off the
!xref linkend=guc-full-page-writes parameter.
/para
   /sect1
  
--- 164,175 
 productnamePostgreSQL/ periodically writes full page images to
 permanent WAL storage emphasisbefore/ modifying the actual page on
 disk. By doing this, during crash recovery productnamePostgreSQL/ can
!restore partially-written pages.  If you have file-system software
!that prevents partial page writes (e.g., ZFS),  you can turn off
!this page imaging by turning off the xref
!linkend=guc-full-page-writes parameter. Battery-Backed unit
!(BBU) disk controllers do not prevent partial page writes unless
!they guarantee that data is written to the BBU as full (8kB) pages.
/para
   /sect1
  

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-12-01 Thread Pierre C



Is that true?  I have no idea.  I thought everything was done at the
512-byte block level.


Newer disks (2TB and up) can have 4k sectors, but this still means a page  
spans several sectors.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-12-01 Thread Bruce Momjian
Pierre C wrote:
 
  Is that true?  I have no idea.  I thought everything was done at the
  512-byte block level.
 
 Newer disks (2TB and up) can have 4k sectors, but this still means a page  
 spans several sectors.

Yes, I had heard about that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-11-30 Thread Bruce Momjian
Greg Smith wrote:
 Kevin Grittner wrote:
  I assume that we send a full
  8K to the OS cache, and the file system writes disk sectors
  according to its own algorithm.  With either platters or BBU cache,
  the data is persisted on fsync; why do you see a risk with one but
  not the other
 
 I'd like a 10 minute argument please.  I started to write something to 
 refute this, only to clarify in my head the sequence of events that 
 leads to the most questionable result, where I feel a bit less certain 
 than I did before of the safety here.  Here is the worst case I believe 
 you're describing:
 
 1) Transaction is written to the WAL and sync'd; client receives 
 COMMIT.  Since full_page_writes is off, the data in the WAL consists 
 only of the delta of what changed on the page.
 2) 8K database page is written to OS cache
 3) PG calls fsync to force the database block out
 4) OS writes first 4K block of the change to the BBU write cache.  Worst 
 case, this fills the cache, and it takes a moment for some random writes 
 to process before it has space to buffer again (makes this more likely 
 to happen, but it's not required to see the failure case here)
 5) Sudden power interruption, second half of the page write is lost
 6) Server restarts
 7) That 4K write is now replayed from the battery's cache
 
 At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 

Based on this report, I think we need to update our documentation and
backpatch removal of text that says that BBU users can safely turn off
full-page writes.  Patch attached.

I think we have fallen into a trap I remember from the late 1990's where
I was assuming that an 8k-block based file system would write to the
disk atomically in 8k segments, which of course it cannot.  My bet is
that even if you write to the kernel in 8k pages, and have an 8k file
system, the disk is still accessed via 512-byte blocks, even with a BBU.
 
-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index a2724fa..1e67bbd 100644
*** /tmp/pgrevert.14281/7sLqTb_wal.sgml	Tue Nov 30 21:57:17 2010
--- doc/src/sgml/wal.sgml	Tue Nov 30 21:56:49 2010
***
*** 164,173 
 productnamePostgreSQL/ periodically writes full page images to
 permanent WAL storage emphasisbefore/ modifying the actual page on
 disk. By doing this, during crash recovery productnamePostgreSQL/ can
!restore partially-written pages.  If you have a battery-backed disk
!controller or file-system software that prevents partial page writes
!(e.g., ZFS),  you can turn off this page imaging by turning off the
!xref linkend=guc-full-page-writes parameter.
/para
   /sect1
  
--- 164,175 
 productnamePostgreSQL/ periodically writes full page images to
 permanent WAL storage emphasisbefore/ modifying the actual page on
 disk. By doing this, during crash recovery productnamePostgreSQL/ can
!restore partially-written pages.  If you have file-system software
!that prevents partial page writes (e.g., ZFS),  you can turn off
!this page imaging by turning off the xref
!linkend=guc-full-page-writes parameter. Battery-Backed unit
!(BBU) disk controllers do not prevent partial page writes unless
!they guarantee that data is written to the BBU as full (8kB) pages.
/para
   /sect1
  

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-11-30 Thread Bruce Momjian
Kevin Grittner wrote:
 Greg Smith g...@2ndquadrant.com wrote:
  
  I think Kevin's point here may be that if your fsync isn't
  reliable, you're always in trouble.  But if your fsync is good,
  even torn pages should be repairable by the deltas written to the
  WAL
  
 I was actually just arguing that a BBU doesn't eliminate a risk
 here; if there is a risk with production-quality disk drives, there
 is a risk with a controller with a BBU cache.  The BBU cache just
 tends to reduce the window of time in which corruption can occur.  I
 wasn't too sure of *why* there was a risk, but Tom's post cleared
 that up.
  
 I wonder why we need to expose this GUC at all -- perhaps it should
 be off when fsync is off and on otherwise?  Leaving it on without
 fsync is just harming performance for not much benefit, and turning
 it off with fsync seems to be saying that you are willing to
 tolerate a known risk of database corruption, just not quite so much
 as you have without fsync.  In reality it seems most likely to be a
 mistake, either way.

According to our docs, and my submitted patch, if you are using ZFS then
you can turn off full-page writes, so full-page writes are still useful.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-11-30 Thread Bruce Momjian
Greg Smith wrote:
 Tom Lane wrote:
  You've got entirely too simplistic a view of what the delta might be,
  I fear.  In particular there are various sorts of changes that involve
  inserting the data carried in the WAL record and shifting pre-existing
  data around to make room, or removing an item and moving remaining data
  around.  If you try to replay that type of action against a torn page,
  you'll get corrupted results.

 
 I wasn't sure exactly how those were encoded, thanks for the 
 clarification.  Given that, it seems to me there are only two situations 
 where full_page_writes is safe to turn off:
 
 1) The operating system block size is exactly the same database block 
 size, and all writes are guaranteed to be atomic to that block size. 

Is that true?  I have no idea.  I thought everything was done at the
512-byte block level.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread Aidan Van Dyk
On Fri, Oct 29, 2010 at 11:43 AM, Robert Haas robertmh...@gmail.com wrote:

 Well, we COULD keep the data in shared buffers, and then copy it into
 an mmap()'d region rather than calling write(), but I'm not sure
 there's any advantage to it.  Managing address space mappings is a
 pain in the butt.

I could see this being a *theoretical* benefit in the case that the
background writer gains the ability to write out all blocks associated
with a file in order.  In that case, you might get a win because you
could get a single mmap of the entire file, and just wholesale memcpy
blocks across, then sync/unmap it.

This, of course assumes a few things that must be for it to be per formant:
0) a list of blocks to be written grouped by files is readily available.
1) The pages you write to must be in the page cache, or your memcpy is
going to fault them in.  With a plain write, you don't need the
over-written page in the cache.
2) Now, instead of the torn-page problem being FS block/sector sized
base, you can now actually have a possibly arbitrary amount of the
block memory written when the kernel writes out the page.  you
*really* need full-page-writes.
3) The mmap overhead required for the kernel to setup the mappings is
less than the repeated syscalls of a simple write().

All those things seem like something that somebody could synthetically
benchmark to prove value before even trying to bolt into PostgreSQL.

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It's true that we don't know whether write() causes an immediate or
 delayed disk write, but we generally don't care that much.  What we do
 care about is being able to ensure that a WAL write happens before the
 data write, and with mmap we don't have control over that.

 Well, we COULD keep the data in shared buffers, and then copy it into
 an mmap()'d region rather than calling write(), but I'm not sure
 there's any advantage to it.  Managing address space mappings is a
 pain in the butt.

In principle that ought to be right about the same speed as using
write() to copy the data from shared buffers to kernel disk buffers,
anyway.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread Robert Haas
On Fri, Oct 29, 2010 at 11:56 AM, Aidan Van Dyk ai...@highrise.ca wrote:
 1) The pages you write to must be in the page cache, or your memcpy is
 going to fault them in.  With a plain write, you don't need the
 over-written page in the cache.

I seem to remember a time many years ago when I got bitten by this
problem.  The fact that our I/O is in 8K pages means this could be a
pretty severe hit, I think.

 2) Now, instead of the torn-page problem being FS block/sector sized
 base, you can now actually have a possibly arbitrary amount of the
 block memory written when the kernel writes out the page.  you
 *really* need full-page-writes.

Yeah.

 3) The mmap overhead required for the kernel to setup the mappings is
 less than the repeated syscalls of a simple write().

You'd expect to save something from that; but on the other hand, at
least on 32-bit systems, there's a very limited number of 1GB files
that can be simultaneously mapped into one address space, and it's a
lot smaller than the number of file descriptors that you can have
open.   Rumor has it that cutting down the number of fds that can stay
open simultaneously is pretty bad for performance, so cutting it down
to a number you can count on one hand (maybe one finger) would
probably be bad.  Maybe on 64-bit it would be OK but it seems like an
awful lot of complexity for at most a minor savings (and a pretty bad
anti-savings if point #1 kicks in).

Anyway this is all totally off-topic...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread Alvaro Herrera
Excerpts from Greg Smith's message of jue oct 21 14:04:17 -0300 2010:

 What I would like to do is beef up the documentation with some concrete 
 examples of how to figure out if your cache and associated write path 
 are working reliably or not.  It should be possible to include does 
 this handle full page writes correctly? in that test suite.  Until we 
 have something like that, I'm concerned that bugs in filesystem or 
 controller handling may make full_page_writes unsafe even with a BBU, 
 and we'd have no way for people to tell if that's true or not.

I think if you assume that there are bugs in the filesystem which you
need to protect against, you are already hosed.  I imagine there must be
some filesystem bug that makes it safe to have full_page_writes=on, but
unsafe to have full_page_writes=off; but I'd probably discard those as a
rare minority and thus not worth worrying about.

I agree it would be worth testing though.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread david

On Fri, 29 Oct 2010, Robert Haas wrote:


On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:

James Mansion ja...@mansionfamily.plus.com writes:

Tom Lane wrote:

The other and probably worse problem is that there's no application
control over how soon changes to mmap'd pages get to disk.  An msync
will flush them out, but the kernel is free to write dirty pages sooner.
So if they're depending for consistency on writes not happening until
msync, it's broken by design.  (This is one of the big reasons we don't
use mmap'd space for Postgres disk buffers.)



Well, I agree that it sucks for the reason you give - but you use
write and that's *exactly* the same in terms of when it gets written,
as when you update a byte on an mmap'd page.


Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we
say so.  If the buffer were mmap'd then we'd have no control over that,
which makes it mighty hard to obey the WAL write log before data
paradigm.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do
care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.


Well, we COULD keep the data in shared buffers, and then copy it into
an mmap()'d region rather than calling write(), but I'm not sure
there's any advantage to it.  Managing address space mappings is a
pain in the butt.


keep in mind that you have no way of knowing what order the data in the 
mmap region gets written out to disk.


David Lang
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread James Mansion

Tom Lane wrote:

Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we
  
Well, I don't know where yu got the idea I was refering to that sort of 
thing - its

the same as writing to a buffer before copying to the mmap'd area.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do
  

Which is what I was refering to.

care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.

  
I think you have just the same control either way, because you can only 
force ordering
with an appropriate explicit sync, and in the absence of such a sync all 
bets are off for
whether/when each disk page is written out, and if you can't ensure that 
the controller
and disk are write through you'd better do a hardware cache flush.too, 
right?


A shame that so many systems have relatively poor handling of that 
hardware flush.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-29 Thread david

On Fri, 29 Oct 2010, James Mansion wrote:


Tom Lane wrote:

Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we

Well, I don't know where yu got the idea I was refering to that sort of thing 
- its

the same as writing to a buffer before copying to the mmap'd area.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do


Which is what I was refering to.

care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.


I think you have just the same control either way, because you can only force 
ordering
with an appropriate explicit sync, and in the absence of such a sync all bets 
are off for
whether/when each disk page is written out, and if you can't ensure that the 
controller

and disk are write through you'd better do a hardware cache flush.too, right?

A shame that so many systems have relatively poor handling of that hardware 
flush.


the issue is that when you update a mmaped chunk of data, it could be 
written out immediatly without you doing _anything_ (and thanks to 
multiple cores/threads, it could get written out while you are still in 
the middle of updating it). When you update an internal buffer and then 
write that, you know that nothing will hit the disk before you issue the 
write command.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-28 Thread James Mansion

Tom Lane wrote:

The other and probably worse problem is that there's no application
control over how soon changes to mmap'd pages get to disk.  An msync
will flush them out, but the kernel is free to write dirty pages sooner.
So if they're depending for consistency on writes not happening until
msync, it's broken by design.  (This is one of the big reasons we don't
use mmap'd space for Postgres disk buffers.)
  
Well, I agree that it sucks for the reason you give - but you use write 
and that's *exactly* the
same in terms of when it gets written, as when you update a byte on an 
mmap'd page.


And you're quite happy to use write.

The only difference is that its a lot more explicit where the point of 
'maybe its written and maybe

it isn't' occurs.

There need be no real difference in the architecture for one over the 
other: there does seem to be
evidence that write and read can have better forward-read and 
write-behind behaviour, because
read/write does allow you to initiate an IO with a hint to a size that 
exceeds a hardware page.


And yes, after getting into the details while starting to port TC to 
Windows, I decided to bin
it.  Especially handy that SQLite3 has WAL now.  (And one last dig - TC 
didn't even
have a checksum that would let you tell when it had been broken: but it 
might all be fixed now

of course, I don't have time to check.)

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-28 Thread Tom Lane
James Mansion ja...@mansionfamily.plus.com writes:
 Tom Lane wrote:
 The other and probably worse problem is that there's no application
 control over how soon changes to mmap'd pages get to disk.  An msync
 will flush them out, but the kernel is free to write dirty pages sooner.
 So if they're depending for consistency on writes not happening until
 msync, it's broken by design.  (This is one of the big reasons we don't
 use mmap'd space for Postgres disk buffers.)

 Well, I agree that it sucks for the reason you give - but you use
 write and that's *exactly* the same in terms of when it gets written,
 as when you update a byte on an mmap'd page.

Uh, no, it is not.  The difference is that we can update a byte in a
shared buffer, and know that it *isn't* getting written out before we
say so.  If the buffer were mmap'd then we'd have no control over that,
which makes it mighty hard to obey the WAL write log before data
paradigm.

It's true that we don't know whether write() causes an immediate or
delayed disk write, but we generally don't care that much.  What we do
care about is being able to ensure that a WAL write happens before the
data write, and with mmap we don't have control over that.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-27 Thread Robert Haas
On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch wult...@gmail.com wrote:
 On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote:
 The double write buffer is one of the few areas where InnoDB does more
 IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
 checkpoints (which help to keep dirty pages in memory longer),
 buffering of writing out changes to secondary indexes, and recently
 tunable page level compression.

 Baron Schwartz was talking to me about this at Surge.  I don't really
 understand how the fuzzy checkpoint stuff works, and I haven't been
 able to find a good description of it anywhere.  How does it keep
 dirty pages in memory longer?  Details on the other things you mention
 would be interesting to hear, too.

 For checkpoint behavior:
 http://books.google.com/books?id=S_yHERPRZScCpg=PA606lpg=PA606dq=fuzzy+checkpointsource=blots=JJrzRUKBGhsig=UOMPsRy5E-YDgjAFkaSVn3dps_Mhl=enei=_k8yTOfeHYzZnAepyumLBAsa=Xoi=book_resultct=resultresnum=8ved=0CEYQ6AEwBw#v=onepageq=fuzzy%20checkpointf=false

 I would think that best case behavior sharp checkpoints with a large
 checkpoint_completion_target would have behavior similar to a fuzzy
 checkpoint.

Well, under that definition of a fuzzy checkpoint, our checkpoints are
fuzzy even with checkpoint_completion_target=0.

What Baron seemed to be describing was a scheme whereby you could do
what I might call partial checkpoints.  IOW, you want to move the redo
pointer without writing out ALL the dirty buffers in memory, so you
write out the pages with the oldest LSNs and then move the redo
pointer to the oldest LSN you have left.  Except that doesn't quite
work, because the page might have been dirtied at LSN X and then later
updated again at LSN Y, and you still have to flush it to disk before
moving the redo pointer to any value X.  So you work around that by
maintaining a first dirtied LSN for each page as well as the current
LSN.

I'm not 100% sure that this is how it works or that it would work in
PG, but even assuming that it is and does, I'm not sure what the
benefit is over the checkpoint-spreading logic we have now.  There
might be some benefit in sorting the writes that we do, so that we can
spread out the fsyncs.  So, write all the blocks to a give file,
fsync, and then repeat for each underlying data file that has at least
one dirty block.  But that's completely orthogonal to (and would
actually be hindered by) the approach described in the preceding
paragraph.

 Insert (for innodb 1.1+ evidently there is also does delete and purge)
 buffering:
 http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html

We do something a bit like this for GIST indices.  It would be
interesting to see if it also has a benefit for btree indices.

 For a recent ~800GB db I had to restore, the insert buffer saved 92%
 of io needed for secondary indexes.

 Compression:
 http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html

 For many workloads 50% compression results in negligible impact to
 performance. For certain workloads compression can help performance.
 Please note that InnoDB also has non-tunable toast like feature.

Interesting.  I am surprised this works well.  It seems that this only
works for pages that can be compressed by =50%, which seems like it
could result in a lot of CPU wasted on failed attempts to compress.

 Given that InnoDB is not shipping its logs across the wire, I don't
 think many users would really care if it used the double writer or
 full page writes approach to the redo log (other than the fact that
 the log files would be bigger). PG on the other hand *is* pushing its
 logs over the wire...

 So how is InnoDB doing replication?  Is there a second log just for that?


 The other log is the binary log and it is one of the biggest
 problems with MySQL. Running MySQL in such a way that the binary log
 stays in sync with the InnoDB redo has a very significant impact on
 performance.
 http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
 http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
 (check out the pretty graph)

Hmm.  That seems kinda painful.  Having to ship full page images over
the wire doesn't seems so bad by comparison, though I'm not very happy
about having to do that either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-27 Thread Rob Wultsch
On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch wult...@gmail.com wrote:
 On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote:
 The double write buffer is one of the few areas where InnoDB does more
 IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
 checkpoints (which help to keep dirty pages in memory longer),
 buffering of writing out changes to secondary indexes, and recently
 tunable page level compression.

 Baron Schwartz was talking to me about this at Surge.  I don't really
 understand how the fuzzy checkpoint stuff works, and I haven't been
 able to find a good description of it anywhere.  How does it keep
 dirty pages in memory longer?  Details on the other things you mention
 would be interesting to hear, too.

 For checkpoint behavior:
 http://books.google.com/books?id=S_yHERPRZScCpg=PA606lpg=PA606dq=fuzzy+checkpointsource=blots=JJrzRUKBGhsig=UOMPsRy5E-YDgjAFkaSVn3dps_Mhl=enei=_k8yTOfeHYzZnAepyumLBAsa=Xoi=book_resultct=resultresnum=8ved=0CEYQ6AEwBw#v=onepageq=fuzzy%20checkpointf=false

 I would think that best case behavior sharp checkpoints with a large
 checkpoint_completion_target would have behavior similar to a fuzzy
 checkpoint.

 Well, under that definition of a fuzzy checkpoint, our checkpoints are
 fuzzy even with checkpoint_completion_target=0.

 What Baron seemed to be describing was a scheme whereby you could do
 what I might call partial checkpoints.  IOW, you want to move the redo
 pointer without writing out ALL the dirty buffers in memory, so you
 write out the pages with the oldest LSNs and then move the redo
 pointer to the oldest LSN you have left.  Except that doesn't quite
 work, because the page might have been dirtied at LSN X and then later
 updated again at LSN Y, and you still have to flush it to disk before
 moving the redo pointer to any value X.  So you work around that by
 maintaining a first dirtied LSN for each page as well as the current
 LSN.

 I'm not 100% sure that this is how it works or that it would work in
 PG, but even assuming that it is and does, I'm not sure what the
 benefit is over the checkpoint-spreading logic we have now.  There
 might be some benefit in sorting the writes that we do, so that we can
 spread out the fsyncs.  So, write all the blocks to a give file,
 fsync, and then repeat for each underlying data file that has at least
 one dirty block.  But that's completely orthogonal to (and would
 actually be hindered by) the approach described in the preceding
 paragraph.

I wish I could answer your questions better. I am a power user that
does not fully understand InnoDB internals. There are not all that
many folks that have a very good understanding of InnoDB internals
(given how well it works there is not all that much need).


 Insert (for innodb 1.1+ evidently there is also does delete and purge)
 buffering:
 http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html

 We do something a bit like this for GIST indices.  It would be
 interesting to see if it also has a benefit for btree indices.

 For a recent ~800GB db I had to restore, the insert buffer saved 92%
 of io needed for secondary indexes.

 Compression:
 http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html

 For many workloads 50% compression results in negligible impact to
 performance. For certain workloads compression can help performance.
 Please note that InnoDB also has non-tunable toast like feature.

 Interesting.  I am surprised this works well.  It seems that this only
 works for pages that can be compressed by =50%, which seems like it
 could result in a lot of CPU wasted on failed attempts to compress.

In my world, the spinning disk is almost always the bottleneck.
Trading CPU for IO is almost always a good deal for me.


 Given that InnoDB is not shipping its logs across the wire, I don't
 think many users would really care if it used the double writer or
 full page writes approach to the redo log (other than the fact that
 the log files would be bigger). PG on the other hand *is* pushing its
 logs over the wire...

 So how is InnoDB doing replication?  Is there a second log just for that?


 The other log is the binary log and it is one of the biggest
 problems with MySQL. Running MySQL in such a way that the binary log
 stays in sync with the InnoDB redo has a very significant impact on
 performance.
 http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
 http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
 (check out the pretty graph)

 Hmm.  That seems kinda painful.  Having to ship full page images over
 the wire doesn't seems so bad by comparison, though I'm not very happy
 about having to do that either.


The binary log is less than ideal, but with MySQL replication I can
replicate 

Re: [PERFORM] BBU Cache vs. spindles

2010-10-26 Thread Robert Haas
On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Rob Wultsch wult...@gmail.com wrote:

 I would think full_page_writes=off + double write buffer should be
 far superior, particularly given that the WAL is shipped over the
 network to slaves.

 For a reasonably brief description of InnoDB double write buffers, I
 found this:

 http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/

 One big question before even considering this would by how to
 determine whether a potentially torn page is inconsistent.
 Without a page CRC or some such mechanism, I don't see how this
 technique is possible.

There are two sides to this problem: figuring out when to write a page
to the double write buffer, and figuring out when to read it back from
the double write buffer.  The first seems easy: we just do it whenever
we would XLOG a full page image.  As to the second, when we write the
page out to the double write buffer, we could also write to the double
write buffer the LSN of the WAL record which depends on that full page
image.  Then, at the start of recovery, we scan the double write
buffer and remember all those LSNs.  When we reach one of them, we
replay the full page image.

The good thing about this is that it would reduce WAL volume; the bad
thing about it is that it would probably mean doing two fsyncs where
we only now do one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-26 Thread Rob Wultsch
On Tue, Oct 26, 2010 at 5:41 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Oct 22, 2010 at 3:05 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Rob Wultsch wult...@gmail.com wrote:

 I would think full_page_writes=off + double write buffer should be
 far superior, particularly given that the WAL is shipped over the
 network to slaves.

 For a reasonably brief description of InnoDB double write buffers, I
 found this:

 http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/

 One big question before even considering this would by how to
 determine whether a potentially torn page is inconsistent.
 Without a page CRC or some such mechanism, I don't see how this
 technique is possible.

 There are two sides to this problem: figuring out when to write a page
 to the double write buffer, and figuring out when to read it back from
 the double write buffer.  The first seems easy: we just do it whenever
 we would XLOG a full page image.  As to the second, when we write the
 page out to the double write buffer, we could also write to the double
 write buffer the LSN of the WAL record which depends on that full page
 image.  Then, at the start of recovery, we scan the double write
 buffer and remember all those LSNs.  When we reach one of them, we
 replay the full page image.

 The good thing about this is that it would reduce WAL volume; the bad
 thing about it is that it would probably mean doing two fsyncs where
 we only now do one.


The double write buffer is one of the few areas where InnoDB does more
IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
checkpoints (which help to keep dirty pages in memory longer),
buffering of writing out changes to secondary indexes, and recently
tunable page level compression.

Given that InnoDB is not shipping its logs across the wire, I don't
think many users would really care if it used the double writer or
full page writes approach to the redo log (other than the fact that
the log files would be bigger). PG on the other hand *is* pushing its
logs over the wire...

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-26 Thread Robert Haas
On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote:
 The double write buffer is one of the few areas where InnoDB does more
 IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
 checkpoints (which help to keep dirty pages in memory longer),
 buffering of writing out changes to secondary indexes, and recently
 tunable page level compression.

Baron Schwartz was talking to me about this at Surge.  I don't really
understand how the fuzzy checkpoint stuff works, and I haven't been
able to find a good description of it anywhere.  How does it keep
dirty pages in memory longer?  Details on the other things you mention
would be interesting to hear, too.

 Given that InnoDB is not shipping its logs across the wire, I don't
 think many users would really care if it used the double writer or
 full page writes approach to the redo log (other than the fact that
 the log files would be bigger). PG on the other hand *is* pushing its
 logs over the wire...

So how is InnoDB doing replication?  Is there a second log just for that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-26 Thread Rob Wultsch
On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch wult...@gmail.com wrote:
 The double write buffer is one of the few areas where InnoDB does more
 IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
 checkpoints (which help to keep dirty pages in memory longer),
 buffering of writing out changes to secondary indexes, and recently
 tunable page level compression.

 Baron Schwartz was talking to me about this at Surge.  I don't really
 understand how the fuzzy checkpoint stuff works, and I haven't been
 able to find a good description of it anywhere.  How does it keep
 dirty pages in memory longer?  Details on the other things you mention
 would be interesting to hear, too.

For checkpoint behavior:
http://books.google.com/books?id=S_yHERPRZScCpg=PA606lpg=PA606dq=fuzzy+checkpointsource=blots=JJrzRUKBGhsig=UOMPsRy5E-YDgjAFkaSVn3dps_Mhl=enei=_k8yTOfeHYzZnAepyumLBAsa=Xoi=book_resultct=resultresnum=8ved=0CEYQ6AEwBw#v=onepageq=fuzzy%20checkpointf=false

I would think that best case behavior sharp checkpoints with a large
checkpoint_completion_target would have behavior similar to a fuzzy
checkpoint.

Insert (for innodb 1.1+ evidently there is also does delete and purge)
buffering:
http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html

For a recent ~800GB db I had to restore, the insert buffer saved 92%
of io needed for secondary indexes.

Compression:
http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html

For many workloads 50% compression results in negligible impact to
performance. For certain workloads compression can help performance.
Please note that InnoDB also has non-tunable toast like feature.


 Given that InnoDB is not shipping its logs across the wire, I don't
 think many users would really care if it used the double writer or
 full page writes approach to the redo log (other than the fact that
 the log files would be bigger). PG on the other hand *is* pushing its
 logs over the wire...

 So how is InnoDB doing replication?  Is there a second log just for that?


The other log is the binary log and it is one of the biggest
problems with MySQL. Running MySQL in such a way that the binary log
stays in sync with the InnoDB redo has a very significant impact on
performance.
http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
(check out the pretty graph)

If you are going to West you should considering heading over to the
Facebook office on Tuesday as the MySQL team is having something of an
open house:
http://www.facebook.com/event.php?eid=160712450628622

Mark Callaghan from the Facebook MySQL Engineering (and several
members of their ops team, for that matter) team understands InnoDB
dramatically better than I do.

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-24 Thread James Mansion

Kevin Grittner wrote:

On what do you base that assumption?  I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other?
  
Surely 'the data is persisted sometime after our write and before the 
fsynch returns, but

may be written:
- in small chunks
- out of order
- in an unpredictable way'

When I looked at the internals of TokyoCabinet for example, the design 
was flawed but
would be 'fairly robust' so long as mmap'd pages that were dirtied did 
not get persisted

until msync, and were then persisted atomically.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-24 Thread Greg Smith

James Mansion wrote:
When I looked at the internals of TokyoCabinet for example, the design 
was flawed but
would be 'fairly robust' so long as mmap'd pages that were dirtied did 
not get persisted

until msync, and were then persisted atomically.


If TokyoCabinet presumes that's true and overwrites existing blocks with 
that assumption, it would land onto my list of databases I wouldn't 
trust to hold my TODO list.  Flip off power to a server, and you have no 
idea what portion of the blocks sitting in the drive's cache actually 
made it to disk; that's not even guaranteed atomic to the byte level.  
Torn pages happen all the time unless you either a) put the entire write 
into a non-volatile cache before writing any of it, b) write and sync 
somewhere else first and then do a journaled filesystem pointer swap 
from the old page to the new one, or c) journal the whole write the way 
PostgreSQL does with full_page_writes and the WAL.  The discussion here 
veered off over whether (a) was sufficiently satisfied just by having a 
RAID controller with battery backup, and what I concluded from the dive 
into the details is that it's definitely not true unless the filesystem 
block size exactly matches the database one.  And even then, make sure 
you test heavily.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-24 Thread Greg Smith

Jesper Krogh wrote:
Can you point to some ZFS docs that tell that this is the case..  I'd 
be surprised
if it doesnt copy away the old block and replaces it with the new one 
in-place. The
other behaviour would quite quickly lead to a hugely fragmented 
filesystem that

performs next to useless and ZFS doesnt seem to be in that category..


http://all-unix.blogspot.com/2007/03/zfs-cow-and-relate-features.html

Blocks containing active data are never overwritten in place; instead, 
a new block is allocated, modified data is written to it, and then any 
metadata blocks referencing it are similarly read, reallocated, and 
written.


http://opensolaris.org/jive/thread.jspa?messageID=19264 discusses how 
this interacts with the common types of hardware around:  no guaratees 
with lying hard drives as always, but otherwise you're fine.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-24 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 James Mansion wrote:
 When I looked at the internals of TokyoCabinet for example, the design 
 was flawed but
 would be 'fairly robust' so long as mmap'd pages that were dirtied did 
 not get persisted
 until msync, and were then persisted atomically.

 If TokyoCabinet presumes that's true and overwrites existing blocks with 
 that assumption, it would land onto my list of databases I wouldn't 
 trust to hold my TODO list.  Flip off power to a server, and you have no 
 idea what portion of the blocks sitting in the drive's cache actually 
 made it to disk; that's not even guaranteed atomic to the byte level.  

The other and probably worse problem is that there's no application
control over how soon changes to mmap'd pages get to disk.  An msync
will flush them out, but the kernel is free to write dirty pages sooner.
So if they're depending for consistency on writes not happening until
msync, it's broken by design.  (This is one of the big reasons we don't
use mmap'd space for Postgres disk buffers.)

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-23 Thread Kevin Grittner
Rob Wultsch  wrote:
 
 I really would like to work with PG more and this seems like
 [full_page_writes] would be a significant hindrance for certain
 usage patterns. Lots of replication does not take place over gig...
 
Certainly most of the Wisconsin State Courts replication takes place
over WAN connections at a few Mbps.  I haven't seen any evidence that
having full_page_writes on has caused us problems, personally.
 
In the PostgreSQL community you generally need to show some hard
numbers from a repeatable test case for the community to believe that
there's a problem which needs fixing, much less to buy in to some
particular fix for the purported problem.  On the other hand, if you
can show that there actually *is* a problem, I've never seen a group
which responds so quickly and effectively to solve it as the
PostgreSQL community.  Don't get too attached to a particular
solution without proof that it's better than the alternatives,
though
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-23 Thread Josh Berkus



Even if it's possible, it's far from clear to me that it would be an
improvement.  The author estimates (apparently somewhat loosely)
that it's a 5% to 10% performance hit in InnoDB; I'm far from
certain that full_page_writes cost us that much.  Does anyone have
benchmark numbers handy?


It most certainly can, depending on your CPU saturation and I/O support. 
  I've seen a 10% improvement in througput time from turning off 
full_page_writes on some machines, such as when we were doing the 
SpecJAppserver benchmarks on Solaris.


--
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 
 I think Kevin's point here may be that if your fsync isn't
 reliable, you're always in trouble.  But if your fsync is good,
 even torn pages should be repairable by the deltas written to the
 WAL
 
I was actually just arguing that a BBU doesn't eliminate a risk
here; if there is a risk with production-quality disk drives, there
is a risk with a controller with a BBU cache.  The BBU cache just
tends to reduce the window of time in which corruption can occur.  I
wasn't too sure of *why* there was a risk, but Tom's post cleared
that up.
 
I wonder why we need to expose this GUC at all -- perhaps it should
be off when fsync is off and on otherwise?  Leaving it on without
fsync is just harming performance for not much benefit, and turning
it off with fsync seems to be saying that you are willing to
tolerate a known risk of database corruption, just not quite so much
as you have without fsync.  In reality it seems most likely to be a
mistake, either way.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Greg Smith

Tom Lane wrote:

You've got entirely too simplistic a view of what the delta might be,
I fear.  In particular there are various sorts of changes that involve
inserting the data carried in the WAL record and shifting pre-existing
data around to make room, or removing an item and moving remaining data
around.  If you try to replay that type of action against a torn page,
you'll get corrupted results.
  


I wasn't sure exactly how those were encoded, thanks for the 
clarification.  Given that, it seems to me there are only two situations 
where full_page_writes is safe to turn off:


1) The operating system block size is exactly the same database block 
size, and all writes are guaranteed to be atomic to that block size. 

2) You're using a form of journaled filesystem where data blocks are 
never updated, they're always written elsewhere and the filesystem is 
redirected to that new block once it's on disk.


Looks to me like whether or not there's a non-volatile write cache 
sitting in the middle, like a BBU protected RAID card, doesn't really 
make any difference here then.


I think that most people who have thought they were safe to turn off 
full_page_writes in the past did so because they believed they were in 
category (1) here.  I've never advised anyone to do that, because it's 
so difficult to validate the truth of.  Just given that, I'd be tempted 
to join in on suggesting this parameter just go away in the name of 
safety, except that I think category (2) here is growing now.  ZFS is 
the most obvious example where the atomic write implementation seems to 
always make disabling full_page_writes safe.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Jesper Krogh

On 2010-10-22 17:37, Greg Smith wrote:

 I think that most people who have thought they were safe to turn off
 full_page_writes in the past did so because they believed they were
 in category (1) here.  I've never advised anyone to do that, because
 it's so difficult to validate the truth of.  Just given that, I'd be
 tempted to join in on suggesting this parameter just go away in the
 name of safety, except that I think category (2) here is growing now.
 ZFS is the most obvious example where the atomic write implementation
 seems to always make disabling full_page_writes safe.


Can you point to some ZFS docs that tell that this is the case..  I'd be 
surprised
if it doesnt copy away the old block and replaces it with the new one 
in-place. The
other behaviour would quite quickly lead to a hugely fragmented 
filesystem that

performs next to useless and ZFS doesnt seem to be in that category..

  ...  All given my total lack of insight into ZFS.

--
Jesper





Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Rob Wultsch
On Fri, Oct 22, 2010 at 8:37 AM, Greg Smith g...@2ndquadrant.com wrote:
 Tom Lane wrote:

 You've got entirely too simplistic a view of what the delta might be,
 I fear.  In particular there are various sorts of changes that involve
 inserting the data carried in the WAL record and shifting pre-existing
 data around to make room, or removing an item and moving remaining data
 around.  If you try to replay that type of action against a torn page,
 you'll get corrupted results.


 I wasn't sure exactly how those were encoded, thanks for the clarification.
  Given that, it seems to me there are only two situations where
 full_page_writes is safe to turn off:

 1) The operating system block size is exactly the same database block size,
 and all writes are guaranteed to be atomic to that block size.
 2) You're using a form of journaled filesystem where data blocks are never
 updated, they're always written elsewhere and the filesystem is redirected
 to that new block once it's on disk.

 Looks to me like whether or not there's a non-volatile write cache sitting
 in the middle, like a BBU protected RAID card, doesn't really make any
 difference here then.

 I think that most people who have thought they were safe to turn off
 full_page_writes in the past did so because they believed they were in
 category (1) here.  I've never advised anyone to do that, because it's so
 difficult to validate the truth of.  Just given that, I'd be tempted to join
 in on suggesting this parameter just go away in the name of safety, except
 that I think category (2) here is growing now.  ZFS is the most obvious
 example where the atomic write implementation seems to always make disabling
 full_page_writes safe.


For the sake of argument, has PG considered using a double write
buffer similar to InnodB?


-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Kevin Grittner
Rob Wultsch wult...@gmail.com wrote:
 
 has PG considered using a double write buffer similar to InnodB?
 
That seems inferior to the full_page_writes strategy, where you only
write a page twice the first time it is written after a checkpoint. 
We're talking about when we might be able to write *less*, not more.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Rob Wultsch
On Fri, Oct 22, 2010 at 10:28 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Rob Wultsch wult...@gmail.com wrote:

 has PG considered using a double write buffer similar to InnodB?

 That seems inferior to the full_page_writes strategy, where you only
 write a page twice the first time it is written after a checkpoint.
 We're talking about when we might be able to write *less*, not more.

 -Kevin


By write do you mean number of writes, or the number of bytes of the
writes? For number of writes, yes a double write buffer will lose. In
terms of number of bytes, I would think full_page_writes=off + double
write buffer should be far superior, particularly given that the WAL
is shipped over the network to slaves.

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Kevin Grittner
Rob Wultsch wult...@gmail.com wrote:
 
 I would think full_page_writes=off + double write buffer should be
 far superior, particularly given that the WAL is shipped over the
 network to slaves.
 
For a reasonably brief description of InnoDB double write buffers, I
found this:
 
http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
 
One big question before even considering this would by how to
determine whether a potentially torn page is inconsistent. 
Without a page CRC or some such mechanism, I don't see how this
technique is possible.
 
Even if it's possible, it's far from clear to me that it would be an
improvement.  The author estimates (apparently somewhat loosely)
that it's a 5% to 10% performance hit in InnoDB; I'm far from
certain that full_page_writes cost us that much.  Does anyone have
benchmark numbers handy?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Rob Wultsch
On Fri, Oct 22, 2010 at 12:05 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Rob Wultsch wult...@gmail.com wrote:

 I would think full_page_writes=off + double write buffer should be
 far superior, particularly given that the WAL is shipped over the
 network to slaves.

 For a reasonably brief description of InnoDB double write buffers, I
 found this:

 http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/

 One big question before even considering this would by how to
 determine whether a potentially torn page is inconsistent.
 Without a page CRC or some such mechanism, I don't see how this
 technique is possible.

 Even if it's possible, it's far from clear to me that it would be an
 improvement.  The author estimates (apparently somewhat loosely)
 that it's a 5% to 10% performance hit in InnoDB; I'm far from
 certain that full_page_writes cost us that much.  Does anyone have
 benchmark numbers handy?

 -Kevin


Ignoring (briefly) the cost in terms of performance of the different
system, not needing full_page_writes would make geographically
dispersed replication possible for certain cases where it is not
currently (or at least rather painful).

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Kevin Grittner
Rob Wultsch wult...@gmail.com wrote:
 
 not needing full_page_writes would make geographically dispersed
 replication possible for certain cases where it is not currently
 (or at least rather painful).
 
Do you have any hard numbers on WAL file size impact?  How much does
pglesslog help in a file-based WAL transmission environment?  Should
we be considering similar filtering for streaming replication?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-22 Thread Rob Wultsch
On Fri, Oct 22, 2010 at 1:15 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Rob Wultsch wult...@gmail.com wrote:

 not needing full_page_writes would make geographically dispersed
 replication possible for certain cases where it is not currently
 (or at least rather painful).

 Do you have any hard numbers on WAL file size impact?  How much does
 pglesslog help in a file-based WAL transmission environment?  Should
 we be considering similar filtering for streaming replication?

 -Kevin


No, I am DBA that mostly works on MySQL. I have had to deal with
(handwaving...) tangential issues recently. I really would like to
work with PG more and this seems like it would be a significant
hindrance for certain usage patterns. Lots of replication does not
take place over gig...


-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Steve Crawford

On 10/20/2010 09:45 PM, Scott Marlowe wrote:

On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drakej...@commandprompt.com  wrote:
   

On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
 

Ben Chobot wrote:
   

On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:

 

I'm weighing options for a new server. In addition to PostgreSQL, this machine 
will handle some modest Samba and Rsync load.

I will have enough RAM so the virtually all disk-read activity will be cached. 
The average PostgreSQL read activity will be modest - a mix of single-record 
and fairly large (reporting) result-sets. Writes will be modest as well but 
will come in brief (1-5 second) bursts of individual inserts. The rate of 
insert requests will hit 100-200/second for those brief bursts.

So...

Am I likely to be better off putting $$$ toward battery-backup on the RAID or 
toward adding a second RAID-set and splitting off the WAL traffic? Or something 
else?
   

A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
Dedicated WAL spindles are nice and all, but they're still spinning
media. Raid card cache is wy faster, and while it's best at bursty
writes, it sounds like bursty writes are precisely what you have.
 

Totally agree!
   

BBU first, more spindles second.
 

Agreed.  note that while you can get incredible burst performance from
a battery backed cache, due to both caching and writing out of order,
once the throughput begins to saturate at the speed of the disk array,
the bbu cache is now only re-ordering really, as it will eventually
fill up faster than the disks can take the writes, and you'll settle
in at some percentage of your max tps you get for a short benchmark
run.  It's vitally important that once you put a BBU cache in place,
you run a very long running transactional test (pgbench is a simple
one to start with) that floods the io subsystem so you see what you're
average throughput is with the WAL and data store getting flooded.  I
know on my system pgbench runs of a few minutes can be 3 or 4 times
faster than runs that last for the better part of an hour.

   
Thanks for all the replies. This is what I suspected but since I can't 
just buy one of everything to try, I wanted a sanity-check before 
spending the $$$.


I am not too worried about saturating the controller cache as the 
current much lower spec machine can handle the sustained load just fine 
and the bursts are typically only 1-3 seconds long spaced a minute or 
more apart.


Cheers,
Steve




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian
Scott Marlowe wrote:
 On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake j...@commandprompt.com 
 wrote:
  On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
  Ben Chobot wrote:
   On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
  
I'm weighing options for a new server. In addition to PostgreSQL, this 
machine will handle some modest Samba and Rsync load.
   
I will have enough RAM so the virtually all disk-read activity will be 
cached. The average PostgreSQL read activity will be modest - a mix of 
single-record and fairly large (reporting) result-sets. Writes will be 
modest as well but will come in brief (1-5 second) bursts of 
individual inserts. The rate of insert requests will hit 
100-200/second for those brief bursts.
   
So...
   
Am I likely to be better off putting $$$ toward battery-backup on the 
RAID or toward adding a second RAID-set and splitting off the WAL 
traffic? Or something else?
  
   A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
   Dedicated WAL spindles are nice and all, but they're still spinning
   media. Raid card cache is wy faster, and while it's best at bursty
   writes, it sounds like bursty writes are precisely what you have.
 
  Totally agree!
 
  BBU first, more spindles second.
 
 Agreed.  note that while you can get incredible burst performance from
 a battery backed cache, due to both caching and writing out of order,
 once the throughput begins to saturate at the speed of the disk array,
 the bbu cache is now only re-ordering really, as it will eventually
 fill up faster than the disks can take the writes, and you'll settle
 in at some percentage of your max tps you get for a short benchmark
 run.  It's vitally important that once you put a BBU cache in place,
 you run a very long running transactional test (pgbench is a simple
 one to start with) that floods the io subsystem so you see what you're
 average throughput is with the WAL and data store getting flooded.  I
 know on my system pgbench runs of a few minutes can be 3 or 4 times
 faster than runs that last for the better part of an hour.

With a BBU you can turn off full_page_writes, which should decrease the
WAL traffic.

However, I don't see this mentioned in our documentation.  Should I add
it?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 With a BBU you can turn off full_page_writes
 
My understanding is that that is not without risk.  What happens if
the WAL is written, there is a commit, but the data page has not yet
been written to the controller?  Don't we still have a torn page?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian
Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  With a BBU you can turn off full_page_writes
  
 My understanding is that that is not without risk.  What happens if
 the WAL is written, there is a commit, but the data page has not yet
 been written to the controller?  Don't we still have a torn page?

I don't see how full_page_writes affect non-written pages to the
controller.

full_page_writes is designed to guard against a partial write to a
device.  I don't think the raid cache can be partially written to, and
the cache will not be cleared until the drive has fully writen the data
to disk.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith

Bruce Momjian wrote:

With a BBU you can turn off full_page_writes, which should decrease the
WAL traffic.

However, I don't see this mentioned in our documentation.  Should I add
it?
  


What I would like to do is beef up the documentation with some concrete 
examples of how to figure out if your cache and associated write path 
are working reliably or not.  It should be possible to include does 
this handle full page writes correctly? in that test suite.  Until we 
have something like that, I'm concerned that bugs in filesystem or 
controller handling may make full_page_writes unsafe even with a BBU, 
and we'd have no way for people to tell if that's true or not.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 full_page_writes is designed to guard against a partial write to a
 device.  I don't think the raid cache can be partially written to
 
So you're confident that an 8kB write to the controller will not be
done as a series of smaller atomic writes by the OS file system?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith

Kevin Grittner wrote:

Bruce Momjian br...@momjian.us wrote:
 
  

full_page_writes is designed to guard against a partial write to a
device.  I don't think the raid cache can be partially written to

 
So you're confident that an 8kB write to the controller will not be

done as a series of smaller atomic writes by the OS file system?


Sure, that happens.  But if the BBU has gotten an fsync call after the 
8K write, it shouldn't return success until after all 8K are in its 
cache.  That's why full_page_writes should be safe on a system with BBU 
as Bruce is suggesting.  But I'd like to see some independent proof of 
that fact, that includes some targeted tests users can run, before we 
start recommending that practice.


--
Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote: 
 Kevin Grittner wrote:
 
 So you're confident that an 8kB write to the controller will not
 be done as a series of smaller atomic writes by the OS file
 system?
 
 Sure, that happens.  But if the BBU has gotten an fsync call after
 the 8K write, it shouldn't return success until after all 8K are
 in its cache.
 
I'm not concerned about an fsync after the controller has it; I'm
concerned about a system crash in the middle of writing an 8K page
to the controller.  Other than the expected *size* of the window of
time during which you're vulnerable, what does a BBU caching
controller buy you in this regard?  Can't the OS rearrange the
writes of disk sectors after the 8K page is written to the OS cache
so that the window might occasionally be rather large?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian
Kevin Grittner wrote:
 Greg Smith g...@2ndquadrant.com wrote: 
  Kevin Grittner wrote:
  
  So you're confident that an 8kB write to the controller will not
  be done as a series of smaller atomic writes by the OS file
  system?
  
  Sure, that happens.  But if the BBU has gotten an fsync call after
  the 8K write, it shouldn't return success until after all 8K are
  in its cache.
  
 I'm not concerned about an fsync after the controller has it; I'm
 concerned about a system crash in the middle of writing an 8K page
 to the controller.  Other than the expected *size* of the window of
 time during which you're vulnerable, what does a BBU caching
 controller buy you in this regard?  Can't the OS rearrange the
 writes of disk sectors after the 8K page is written to the OS cache
 so that the window might occasionally be rather large?

If the write fails to the controller, the page is not flushed and PG
does not continue.  If the write fails, the fsync never happens, and
hence PG stops.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 I assume we send a full 8k to the controller, and a failure during
 that write is not registered as a write.
 
On what do you base that assumption?  I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Bruce Momjian
Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  I assume we send a full 8k to the controller, and a failure during
  that write is not registered as a write.
  
 On what do you base that assumption?  I assume that we send a full
 8K to the OS cache, and the file system writes disk sectors
 according to its own algorithm.  With either platters or BBU cache,
 the data is persisted on fsync; why do you see a risk with one but
 not the other?

Now that is an interesting question.  We write 8k to the kernel, but the
kernel doesn't have to honor those write sizes, so while we probably
can't get a partial 512-byte block written to disk with an BBU (that
isn't cleanup up by the BBU on reboot), we could get some 512-byte
blocks of an 8k written and others not.

I agree you are right and a BBU does not mean you can safely turn off
full_page_writes.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Andres Freund
On Thursday 21 October 2010 21:42:06 Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  I assume we send a full 8k to the controller, and a failure during
  that write is not registered as a write.
 
 On what do you base that assumption?  I assume that we send a full
 8K to the OS cache, and the file system writes disk sectors
 according to its own algorithm.  With either platters or BBU cache,
 the data is persisted on fsync; why do you see a risk with one but
 not the other?
At least on linux pages can certainly get written out in  8kb batches if 
youre under memory pressure.

Andres



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith

Kevin Grittner wrote:

I assume that we send a full
8K to the OS cache, and the file system writes disk sectors
according to its own algorithm.  With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other


I'd like a 10 minute argument please.  I started to write something to 
refute this, only to clarify in my head the sequence of events that 
leads to the most questionable result, where I feel a bit less certain 
than I did before of the safety here.  Here is the worst case I believe 
you're describing:


1) Transaction is written to the WAL and sync'd; client receives 
COMMIT.  Since full_page_writes is off, the data in the WAL consists 
only of the delta of what changed on the page.

2) 8K database page is written to OS cache
3) PG calls fsync to force the database block out
4) OS writes first 4K block of the change to the BBU write cache.  Worst 
case, this fills the cache, and it takes a moment for some random writes 
to process before it has space to buffer again (makes this more likely 
to happen, but it's not required to see the failure case here)

5) Sudden power interruption, second half of the page write is lost
6) Server restarts
7) That 4K write is now replayed from the battery's cache

At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 
data.  Without a full page write in the WAL, is it always possible to 
restore its original state now?  In theory, I think you do.  Since the 
delta in the WAL should be overwriting all of the bytes that changed 
between the old and new version of the page, applying it on top of any 
four possible states here:


1) None of the data was written to the database page yet
2) The first 4K of data was written out
3) The second 4K of data was written out
4) All 8K was actually written out

Should lead to the same result: an 8K page that includes the change that 
was in the WAL but not onto disk at the point when the crash happened.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Tom Lane
Greg Smith g...@2ndquadrant.com writes:
 At this point, you now have a torn 8K page, with 1/2 old and 1/2 new 
 data.

Right.

 Without a full page write in the WAL, is it always possible to 
 restore its original state now?  In theory, I think you do.  Since the 
 delta in the WAL should be overwriting all of the bytes that changed 
 between the old and new version of the page, applying it on top of any 
 four possible states here:

You've got entirely too simplistic a view of what the delta might be,
I fear.  In particular there are various sorts of changes that involve
inserting the data carried in the WAL record and shifting pre-existing
data around to make room, or removing an item and moving remaining data
around.  If you try to replay that type of action against a torn page,
you'll get corrupted results.

We could possibly redefine the WAL records so that they weren't just the
minimum amount of data but carried every byte that'd changed on the
page, and then I think what you're envisioning would work.  But the
records would be a lot bulkier.  It's not clear to me that this would be
a net savings over the current design, particularly not if there's
a long interval between checkpoints.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-21 Thread Greg Smith

Kevin Grittner wrote:

With either platters or BBU cache,
the data is persisted on fsync; why do you see a risk with one but
not the other


Forgot to address this part.  The troublesome sequence if you don't have 
a BBU is:


1) WAL data is written to the OS cache
2) PG calls fsync
3) Data is tranferred into the drive's volatile, non battery-backed cache
4) Drive lies about data being on disk, says fsync is done
5) That 8K data page is written out to the OS cache, also with fsync, 
then onto the drive.  It says it has that too.
6) Due to current disk head location, 4KB of the data page gets written 
out before it gets to the WAL data

7) System crashes

Now you're dead.  You've just torn a data page, but not written any of 
the data to the WAL necessary to reconstruct any valid version of that page.


I think Kevin's point here may be that if your fsync isn't reliable, 
you're always in trouble.  But if your fsync is good, even torn pages 
should be repairable by the deltas written to the WAL, as I described in 
the message I just sent before this one.  That's true regardless of 
whether you achieved non-lying fsync with a BBU or by turning a 
drive's write cache off.  There's nothing really special about the BBU 
beyond it behind the most common form of reliable write cache that 
works.  You get the same properties at a slower rate with a drive that's 
configured to never lie about writes.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-20 Thread Bruce Momjian
Ben Chobot wrote:
 On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
 
  I'm weighing options for a new server. In addition to PostgreSQL, this 
  machine will handle some modest Samba and Rsync load.
 
  I will have enough RAM so the virtually all disk-read activity will be 
  cached. The average PostgreSQL read activity will be modest - a mix of 
  single-record and fairly large (reporting) result-sets. Writes will be 
  modest as well but will come in brief (1-5 second) bursts of individual 
  inserts. The rate of insert requests will hit 100-200/second for those 
  brief bursts.
 
  So...
 
  Am I likely to be better off putting $$$ toward battery-backup on the RAID 
  or toward adding a second RAID-set and splitting off the WAL traffic? Or 
  something else?
 
 A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
 Dedicated WAL spindles are nice and all, but they're still spinning
 media. Raid card cache is wy faster, and while it's best at bursty
 writes, it sounds like bursty writes are precisely what you have.

Totally agree!

--
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-20 Thread Scott Marlowe
On Wed, Oct 20, 2010 at 8:25 PM, Joshua D. Drake j...@commandprompt.com wrote:
 On Wed, 2010-10-20 at 22:13 -0400, Bruce Momjian wrote:
 Ben Chobot wrote:
  On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:
 
   I'm weighing options for a new server. In addition to PostgreSQL, this 
   machine will handle some modest Samba and Rsync load.
  
   I will have enough RAM so the virtually all disk-read activity will be 
   cached. The average PostgreSQL read activity will be modest - a mix of 
   single-record and fairly large (reporting) result-sets. Writes will be 
   modest as well but will come in brief (1-5 second) bursts of individual 
   inserts. The rate of insert requests will hit 100-200/second for those 
   brief bursts.
  
   So...
  
   Am I likely to be better off putting $$$ toward battery-backup on the 
   RAID or toward adding a second RAID-set and splitting off the WAL 
   traffic? Or something else?
 
  A BBU is, what, $100 or so? Adding one seems a no-brainer to me.
  Dedicated WAL spindles are nice and all, but they're still spinning
  media. Raid card cache is wy faster, and while it's best at bursty
  writes, it sounds like bursty writes are precisely what you have.

 Totally agree!

 BBU first, more spindles second.

Agreed.  note that while you can get incredible burst performance from
a battery backed cache, due to both caching and writing out of order,
once the throughput begins to saturate at the speed of the disk array,
the bbu cache is now only re-ordering really, as it will eventually
fill up faster than the disks can take the writes, and you'll settle
in at some percentage of your max tps you get for a short benchmark
run.  It's vitally important that once you put a BBU cache in place,
you run a very long running transactional test (pgbench is a simple
one to start with) that floods the io subsystem so you see what you're
average throughput is with the WAL and data store getting flooded.  I
know on my system pgbench runs of a few minutes can be 3 or 4 times
faster than runs that last for the better part of an hour.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] BBU Cache vs. spindles

2010-10-08 Thread Ben Chobot
On Oct 7, 2010, at 4:38 PM, Steve Crawford wrote:

 I'm weighing options for a new server. In addition to PostgreSQL, this 
 machine will handle some modest Samba and Rsync load.
 
 I will have enough RAM so the virtually all disk-read activity will be 
 cached. The average PostgreSQL read activity will be modest - a mix of 
 single-record and fairly large (reporting) result-sets. Writes will be modest 
 as well but will come in brief (1-5 second) bursts of individual inserts. The 
 rate of insert requests will hit 100-200/second for those brief bursts.
 
 So...
 
 Am I likely to be better off putting $$$ toward battery-backup on the RAID or 
 toward adding a second RAID-set and splitting off the WAL traffic? Or 
 something else?

A BBU is, what, $100 or so? Adding one seems a no-brainer to me. Dedicated WAL 
spindles are nice and all, but they're still spinning media. Raid card cache is 
wy faster, and while it's best at bursty writes, it sounds like bursty 
writes are precisely what you have.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] BBU Cache vs. spindles

2010-10-07 Thread Steve Crawford
I'm weighing options for a new server. In addition to PostgreSQL, this 
machine will handle some modest Samba and Rsync load.


I will have enough RAM so the virtually all disk-read activity will be 
cached. The average PostgreSQL read activity will be modest - a mix of 
single-record and fairly large (reporting) result-sets. Writes will be 
modest as well but will come in brief (1-5 second) bursts of individual 
inserts. The rate of insert requests will hit 100-200/second for those 
brief bursts.


So...

Am I likely to be better off putting $$$ toward battery-backup on the 
RAID or toward adding a second RAID-set and splitting off the WAL 
traffic? Or something else?


Cheers,
Steve


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance