subject:"Re\: \[PERFORM\] SSD \+ RAID"

Re: [PERFORM] SSD + RAID

2010-03-03 Thread Ron Mayer

Greg Smith wrote:
 Bruce Momjian wrote:
 I always assumed SCSI disks had a write-through cache and therefore
 didn't need a drive cache flush comment.

Some do.  SCSI disks have write-back caches.

Some have both(!) - a write-back cache but the user can explicitly
send write-through requests.

Microsoft explains it well (IMHO) here:
http://msdn.microsoft.com/en-us/library/aa508863.aspx
  For example, suppose that the target is a SCSI device with
   a write-back cache. If the device supports write-through
   requests, the initiator can bypass the write cache by
   setting the force unit access (FUA) bit in the command
   descriptor block (CDB) of the write command.

 this perception, which I've recently come to believe isn't actually
 correct anymore.  ... I'm staring to think this is what
 we've all been observing rather than a write-through cache

I think what we've been observing is that guys with SCSI drives
are more likely to either
 (a) have battery-backed RAID controllers that insure writes succeed,
or
 (b) have other decent RAID controllers that understand details
 like that FUA bit to send write-through requests even if
 a SCSI devices has a write-back cache.

In contrast, most guys with PATA drives are probably running
software RAID (if any) with a RAID stack (older LVM and MD)
known to lose the cache flushing commands.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-03-02 Thread Pierre C




I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.


Maximum performance can only be reached with a writeback cache so the  
drive can reorder and cluster writes, according to the realtime position  
of the heads and platter rotation.


The problem is not the write cache itself, it is that, for your data to be  
safe, the flush cache or barrier command must get all the way through  
the application / filesystem to the hardware, going through a nondescript  
number of software/firmware/hardware layers, all of which may :


- not specify if they honor or ignore flush/barrier commands, and which  
ones

- not specify if they will reordre writes ignoring barriers/flushes or not
- have been written by people who are not aware of such issues
- have been written by companies who are perfectly aware of such issues  
but chose to ignore them to look good in benchmarks

- have some incompatibilities that result in broken behaviour
- have bugs

As far as I'm concerned, a configuration that doesn't properly respect the  
commands needed for data integrity is broken.


The sad truth is that given a software/hardware IO stack, there's no way  
to be sure, and testing isn't easy, if at all possible to do. Some cache  
flushes might be ignored under some circumstances.


For this to change, you don't need a hardware change, but a mentality  
change.


Flash filesystem developers use flash simulators which measure wear  
leveling, etc.


We'd need a virtual box with a simulated virtual harddrive which is able  
to check this.


What a mess.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-03-01 Thread Bruce Momjian

Ron Mayer wrote:
 Bruce Momjian wrote:
  Greg Smith wrote:
  Bruce Momjian wrote:
  I have added documentation about the ATAPI drive flush command, and the

  If one of us goes back into that section one day to edit again it might 
  be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
  that a drive needs to support properly.  I wouldn't bother with another 
  doc edit commit just for that specific part though, pretty obscure.
  
  That setting name was not easy to find so I added it to the
  documentation.
 
 If we're spelling out specific IDE commands, it might be worth
 noting that the corresponding SCSI command is SYNCHRONIZE CACHE[1].
 
 
 Linux apparently sends FLUSH_CACHE commands to IDE drives in the
 exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
 drives[2].
 
 It seems that the same file systems, SW raid layers,
 virtualization platforms, and kernels that have a problem
 sending FLUSH CACHE commands to SATA drives have he same exact
 same problems sending SYNCHRONIZE CACHE commands to SCSI drives.
 With the exact same effect of not getting writes all the way
 through disk caches.

I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-03-01 Thread Bruce Momjian

Greg Smith wrote:
 Ron Mayer wrote:
  Linux apparently sends FLUSH_CACHE commands to IDE drives in the
  exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
  drives[2].
[2] http://hardware.slashdot.org/comments.pl?sid=149349cid=12519114

 
 Well, that's old enough to not even be completely right anymore about 
 SATA disks and kernels.  It's FLUSH_CACHE_EXT that's been added to ATA-6 
 to do the right thing on modern drives and that gets used nowadays, and 
 that doesn't necessarily do so on most of the SSDs out there; all of 
 which Bruce's recent doc additions now talk about correctly.
 
 There's this one specific area we know about that the most popular 
 systems tend to get really wrong all the time; that's got the 
 appropriate warning now with the right magic keywords that people can 
 look into it more if motivated.  While it would be nice to get super 
 thorough and document everything, I think there's already more docs in 
 there than this project would prefer to have to maintain in this area.
 
 Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI?  If the 
 idea is to be complete that's where this would go.  I don't know that 
 the documentation needs to address every possible way every possible 
 filesystem can be flushed. 

The bottom line is that the reason we have so much detailed
documentation about this is that mostly only database folks care about
such issues, so we end up having to research and document this
ourselves --- I don't see any alternatives.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-03-01 Thread Greg Smith


Bruce Momjian wrote:

I always assumed SCSI disks had a write-through cache and therefore
didn't need a drive cache flush comment.
  


There's more detail on all this mess at 
http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks and it includes 
this perception, which I've recently come to believe isn't actually 
correct anymore.  Like the IDE crowd, it looks like one day somebody 
said hey, we lose every write heavy benchmark badly because we only 
have a write-through cache, and that principle got lost along the 
wayside.  What has been true, and I'm staring to think this is what 
we've all been observing rather than a write-through cache, is that the 
proper cache flushing commands have been there in working form for so 
much longer that it's more likely your SCSI driver and drive do the 
right thing if the filesystem asks them to.  SCSI SYNCHRONIZE CACHE has 
a much longer and prouder history than IDE's FLUSH_CACHE and SATA's 
FLUSH_CACHE_EXT.


It's also worth noting that many current SAS drives, the current SCSI 
incarnation, are basically SATA drives with a bridge chipset stuck onto 
them, or with just the interface board swapped out.  This one reason why 
top-end SAS capacities lag behind consumer SATA drives.  They use the 
consumers as beta testers to get the really fundamental firmware issues 
sorted out, and once things are stable they start stamping out the 
version with the SAS interface instead.  (Note that there's a parallel 
manufacturing approach that makes much smaller SAS drives, the 2.5 
server models or those at higher RPMs, that doesn't go through this 
path.  Those are also the really expensive models, due to economy of 
scale issues).  The idea that these would have fundamentally different 
write cache behavior doesn't really follow from that development model.


At this point, there are only two common differences between consumer 
and enterprise hard drives of the same size and RPM when there are 
directly matching ones:


1) You might get SAS instead of SATA as the interface, which provides 
the more mature command set I was talking about above--and therefore may 
give you a sane write-back cache with proper flushing, which is all the 
database really expects.


2) The timeouts when there's a read/write problem are tuned down in the 
enterprise version, to be more compatible with RAID setups where you 
want to push the drive off-line when this happens rather than presuming 
you can fix it.  Consumers would prefer that the drive spent a lot of 
time doing heroics to try and save their sole copy of the apparently 
missing data.


You might get a slightly higher grade of parts if you're lucky too; I 
wouldn't count on it though.  That seems to be saved for the high RPM or 
smaller size drives only.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-27 Thread Greg Smith


Bruce Momjian wrote:

I have added documentation about the ATAPI drive flush command, and the
typical SSD behavior.
  


If one of us goes back into that section one day to edit again it might 
be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
that a drive needs to support properly.  I wouldn't bother with another 
doc edit commit just for that specific part though, pretty obscure.


I find it kind of funny how many discussions run in parallel about even 
really detailed technical implementation details around the world.  For 
example, doesn't 
http://www.mail-archive.com/zfs-disc...@opensolaris.org/msg30585.html 
look exactly like the exchange between myself and Arjen the other day, 
referencing the same AnandTech page?


Could be worse; one of us could be the poor sap at 
http://opensolaris.org/jive/thread.jspa;jsessionid=41B679C30D136C059E1BB7C06CA7DCE0?messageID=397730 
who installed Windows XP, VirtualBox for Windows, an OpenSolaris VM 
inside of it, and then was shocked that cache flushes didn't make their 
way all the way through that chain and had his 10TB ZFS pool corrupted 
as a result.  Hurray for virtualization!


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-27 Thread Ron Mayer

Bruce Momjian wrote:
 Greg Smith wrote:
 Bruce Momjian wrote:
 I have added documentation about the ATAPI drive flush command, and the
   
 If one of us goes back into that section one day to edit again it might 
 be worth mentioning that FLUSH CACHE EXT is the actual ATAPI-6 command 
 that a drive needs to support properly.  I wouldn't bother with another 
 doc edit commit just for that specific part though, pretty obscure.
 
 That setting name was not easy to find so I added it to the
 documentation.

If we're spelling out specific IDE commands, it might be worth
noting that the corresponding SCSI command is SYNCHRONIZE CACHE[1].


Linux apparently sends FLUSH_CACHE commands to IDE drives in the
exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
drives[2].

It seems that the same file systems, SW raid layers,
virtualization platforms, and kernels that have a problem
sending FLUSH CACHE commands to SATA drives have he same exact
same problems sending SYNCHRONIZE CACHE commands to SCSI drives.
With the exact same effect of not getting writes all the way
through disk caches.

No?


[1] http://linux.die.net/man/8/sg_sync
[2] http://hardware.slashdot.org/comments.pl?sid=149349cid=12519114

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-27 Thread Greg Smith


Ron Mayer wrote:

Linux apparently sends FLUSH_CACHE commands to IDE drives in the
exact sample places it sends SYNCHRONIZE CACHE commands to SCSI
drives[2].
  [2] http://hardware.slashdot.org/comments.pl?sid=149349cid=12519114
  


Well, that's old enough to not even be completely right anymore about 
SATA disks and kernels.  It's FLUSH_CACHE_EXT that's been added to ATA-6 
to do the right thing on modern drives and that gets used nowadays, and 
that doesn't necessarily do so on most of the SSDs out there; all of 
which Bruce's recent doc additions now talk about correctly.


There's this one specific area we know about that the most popular 
systems tend to get really wrong all the time; that's got the 
appropriate warning now with the right magic keywords that people can 
look into it more if motivated.  While it would be nice to get super 
thorough and document everything, I think there's already more docs in 
there than this project would prefer to have to maintain in this area.


Are we going to get into IDE, SATA, SCSI, SAS, FC, and iSCSI?  If the 
idea is to be complete that's where this would go.  I don't know that 
the documentation needs to address every possible way every possible 
filesystem can be flushed. 


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-26 Thread Bruce Momjian


I have added documentation about the ATAPI drive flush command, and the
typical SSD behavior.

---

Greg Smith wrote:
 Ron Mayer wrote:
  Bruce Momjian wrote:

  Agreed, thought I thought the problem was that SSDs lie about their
  cache flush like SATA drives do, or is there something I am missing?
  
 
  There's exactly one case I can find[1] where this century's IDE
  drives lied more than any other drive with a cache:
 
 Ron is correct that the problem of mainstream SATA drives accepting the 
 cache flush command but not actually doing anything with it is long gone 
 at this point.  If you have a regular SATA drive, it almost certainly 
 supports proper cache flushing.  And if your whole software/storage 
 stacks understands all that, you should not end up with corrupted data 
 just because there's a volative write cache in there.
 
 But the point of this whole testing exercise coming back into vogue 
 again is that SSDs have returned this negligent behavior to the 
 mainstream again.  See 
 http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion 
 of this in a ZFS context just last month.  There are many documented 
 cases of Intel SSDs that will fake a cache flush, such that the only way 
 to get good reliable writes is to totally disable their writes 
 caches--at which point performance is so bad you might as well have 
 gotten a RAID10 setup instead (and longevity is toast too).
 
 This whole area remains a disaster area and extreme distrust of all the 
 SSD storage vendors is advisable at this point.  Basically, if I don't 
 see the capacitor responsible for flushing outstanding writes, and get a 
 clear description from the manufacturer how the cached writes are going 
 to be handled in the event of a power failure, at this point I have to 
 assume the answer is badly and your data will be eaten.  And the 
 prices for SSDs that meet that requirement are still quite steep.  I 
 keep hoping somebody will address this market at something lower than 
 the standard enterprise prices.  The upcoming SandForce designs seem 
 to have thought this through correctly:  
 http://www.anandtech.com/storage/showdoc.aspx?i=3702p=6  But the 
 product's not out to the general public yet (just like the Seagate units 
 that claim to have capacitor backups--I heard a rumor those are also 
 Sandforce designs actually, so they may be the only ones doing this 
 right and aiming at a lower price).
 
 -- 
 Greg Smith  2ndQuadrant US  Baltimore, MD
 PostgreSQL Training, Services and Support
 g...@2ndquadrant.com   www.2ndQuadrant.us
 

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.62
diff -c -c -r1.62 wal.sgml
*** doc/src/sgml/wal.sgml	20 Feb 2010 18:28:37 -	1.62
--- doc/src/sgml/wal.sgml	27 Feb 2010 01:37:03 -
***
*** 59,66 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure.  Many solid-state drives also have volatile write-back
!caches.  To check write caching on productnameLinux/ use
 commandhdparm -I/;  it is enabled if there is a literal*/ next
 to literalWrite cache/; commandhdparm -W/ to turn off
 write caching.  On productnameFreeBSD/ use
--- 59,69 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure, though acronymATAPI-6/ introduced a drive cache
!flush command that some file systems use, e.g. acronymZFS/.
!Many solid-state drives also have volatile write-back
!caches, and many do not honor cache flush commands by default.
!To check write caching on productnameLinux/ use
 commandhdparm -I/;  it is enabled if there is a literal*/ next
 to literalWrite cache/; commandhdparm -W/ to turn off
 write caching.  On productnameFreeBSD/ use

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-24 Thread Dave Crooke

It's always possible to rebuild into a consistent configuration by assigning
a precedence order; for parity RAID, the data drives take precedence over
parity drives, and for RAID-1 sets it assigns an arbitrary master.

You *should* never lose a whole stripe ... for example, RAID-5 updates do
read old data / parity, write new data, write new parity ... there is no
need to touch any other data disks, so they will be preserved through the
rebuild. Similarly, if only one block is being updated there is no need to
update the entire stripe.

David - what caused /dev/md to decide to take an array offline?

Cheers
Dave

On Tue, Feb 23, 2010 at 3:22 PM, da...@lang.hm wrote:

 On Tue, 23 Feb 2010, Aidan Van Dyk wrote:

  * da...@lang.hm da...@lang.hm [100223 15:05]:

  However, one thing that you do not get protection against with software
 raid is the potential for the writes to hit some drives but not others.
 If this happens the software raid cannot know what the correct contents
 of the raid stripe are, and so you could loose everything in that stripe
 (including contents of other files that are not being modified that
 happened to be in the wrong place on the array)


 That's for stripe-based raid.  Mirror sets like raid-1 should give you
 either the old data, or the new data, both acceptable responses since
 the fsync/barreir hasn't completed.

 Or have I missed another subtle interaction?


 one problem is that when the system comes back up and attempts to check the
 raid array, it is not going to know which drive has valid data. I don't know
 exactly what it does in that situation, but this type of error in other
 conditions causes the system to take the array offline.


 David Lang

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-23 Thread david


On Mon, 22 Feb 2010, Ron Mayer wrote:



Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.



I think I saw some stuff in the last few months on this issue on the 
kernel mailing list. you may want to doublecheck this when 2.6.33 gets 
released (probably this week)


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-23 Thread Pierre C


Note that's power draw per bit.  dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.


Differences between SRAM and DRAM :

- price per byte (DRAM much cheaper)

- silicon area per byte (DRAM much smaller)

- random access latency
   SRAM = fast, uniform, and predictable, usually 0/1 cycles
   DRAM = a few up to a lot of cycles depending on chip type,
   which page/row/column you want to access, wether it's R or W,
   wether the page is already open, etc

In fact, DRAM is the new harddisk. SRAM is used mostly when low-latency is  
needed (caches, etc).


- ease of use :
   SRAM very easy to use : address, data, read, write, clock.
   SDRAM needs a smart controller.
   SRAM easier to instantiate on a silicon chip

- power draw
   When used at high speeds, SRAM ist't power-saving at all, it's used for  
speed.

   However when not used, the power draw is really negligible.

While it is true that you can recover *some* data out of a SRAM/DRAM chip  
that hasn't been powered for a few seconds, you can't really trust that  
data. It's only a forensics tool.


Most DRAM now (especially laptop DRAM) includes special power-saving modes  
which only keep the data retention logic (refresh, etc) powered, but not  
the rest of the chip (internal caches, IO buffers, etc). Laptops, PDAs,  
etc all use this feature in suspend-to-RAM mode. In this mode, the power  
draw is higher than SRAM, but still pretty minimal, so a laptop can stay  
in suspend-to-RAM mode for days.


Anyway, the SRAM vs DRAM isn't really relevant for the debate of SSD data  
integrity. You can backup both with a small battery of ultra-cap.


What is important too is that the entire SSD chipset must have been  
designed with this in mind : it must detect power loss, and correctly  
react to it, and especially not reset itself or do funny stuff to the  
memory when the power comes back. Which means at least some parts of the  
chipset must stay powered to keep their state.


Now I wonder about something. SSDs use wear-leveling which means the  
information about which block was written where must be kept somewhere.  
Which means this information must be updated. I wonder how crash-safe and  
how atomic these updates are, in the face of a power loss.  This is just  
like a filesystem. You've been talking only about data, but the block  
layout information (metadata) is subject to the same concerns. If the  
drive says it's written, not only the data must have been written, but  
also the information needed to locate that data...


Therefore I think the yank-the-power-cord test should be done with random  
writes happening on an aged and mostly-full SSD... and afterwards, I'd be  
interested to know if not only the last txn really committed, but if some  
random parts of other stuff weren't wear-leveled into oblivion at the  
power loss...







--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-23 Thread Nikolas Everett

On Tue, Feb 23, 2010 at 6:49 AM, Pierre C li...@peufeu.com wrote:

  Note that's power draw per bit.  dram is usually much more densely
 packed (it can be with fewer transistors per cell) so the individual
 chips for each may have similar power draws while the dram will be 10
 times as densely packed as the sram.


 Differences between SRAM and DRAM :

 [lots of informative stuff]


I've been slowly reading the paper at
http://people.redhat.com/drepper/cpumemory.pdf  which has a big section on
SRAM vs DRAM with nice pretty pictures. While not strictly relevant its been
illuminating and I wanted to share.

Re: [PERFORM] SSD + RAID

2010-02-23 Thread Scott Carey


On Feb 23, 2010, at 3:49 AM, Pierre C wrote:
 Now I wonder about something. SSDs use wear-leveling which means the  
 information about which block was written where must be kept somewhere.  
 Which means this information must be updated. I wonder how crash-safe and  
 how atomic these updates are, in the face of a power loss.  This is just  
 like a filesystem. You've been talking only about data, but the block  
 layout information (metadata) is subject to the same concerns. If the  
 drive says it's written, not only the data must have been written, but  
 also the information needed to locate that data...
 
 Therefore I think the yank-the-power-cord test should be done with random  
 writes happening on an aged and mostly-full SSD... and afterwards, I'd be  
 interested to know if not only the last txn really committed, but if some  
 random parts of other stuff weren't wear-leveled into oblivion at the  
 power loss...
 

A couple years ago I postulated that SSD's could do random writes fast if they 
remapped blocks.  Microsoft's SSD whitepaper at the time hinted at this too.
Persisting the remap data is not hard.  It goes in the same location as the 
data, or a separate area that can be written to linearly.

Each block may contain its LBA and a transaction ID or other atomic count.  Or 
another block can have that info.  When the SSD
powers up, it can build its table of LBA  block by looking at that data and 
inverting it and keeping the highest transaction ID for duplicate LBA claims.

Although SSD's have to ERASE data in a large block at a time (256K to 2M 
typically), they can write linearly to an erased block in much smaller chunks.
Thus, to commit a write, either:
Data, LBA tag, and txID in same block (may require oddly sized blocks).
or
Data written to one block (not committed yet), then LBA tag and txID written 
elsewhere (which commits the write).  Since its all copy on write, partial 
writes can't happen.
If a block is being moved or compressed when power fails data should never be 
lost since the old data still exists, the new version just didn't commit.  But 
new data that is being written may not be committed yet in the case of a power 
failure unless other measures are taken.

 
 
 
 
 
 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-23 Thread david


On Tue, 23 Feb 2010, da...@lang.hm wrote:


On Mon, 22 Feb 2010, Ron Mayer wrote:



Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.



I think I saw some stuff in the last few months on this issue on the kernel 
mailing list. you may want to doublecheck this when 2.6.33 gets released 
(probably this week)


to clarify further (after getting more sleep ;-)

I believe that the linux software raid always did the right thing if you 
did a fsync/fdatacync. however barriers that filesystems attempted to use 
to avoid the need for a hard fsync used to be silently ignored. I believe 
these are now honored (in at least some configurations)


However, one thing that you do not get protection against with software 
raid is the potential for the writes to hit some drives but not others. If 
this happens the software raid cannot know what the correct contents of 
the raid stripe are, and so you could loose everything in that stripe 
(including contents of other files that are not being modified that 
happened to be in the wrong place on the array)


If you have critical data, you _really_ want to use a raid controller with 
battery backup so that if you loose power you have a chance of eventually 
completing the write.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-23 Thread Aidan Van Dyk

* da...@lang.hm da...@lang.hm [100223 15:05]:

 However, one thing that you do not get protection against with software  
 raid is the potential for the writes to hit some drives but not others. 
 If this happens the software raid cannot know what the correct contents 
 of the raid stripe are, and so you could loose everything in that stripe  
 (including contents of other files that are not being modified that  
 happened to be in the wrong place on the array)

That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't completed.

Or have I missed another subtle interaction?

a.

-- 
Aidan Van Dyk Create like a god,
ai...@highrise.ca   command like a king,
http://www.highrise.ca/   work like a slave.


signature.asc
Description: Digital signature

Re: [PERFORM] SSD + RAID

2010-02-23 Thread david


On Tue, 23 Feb 2010, Aidan Van Dyk wrote:


* da...@lang.hm da...@lang.hm [100223 15:05]:


However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others.
If this happens the software raid cannot know what the correct contents
of the raid stripe are, and so you could loose everything in that stripe
(including contents of other files that are not being modified that
happened to be in the wrong place on the array)


That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't completed.

Or have I missed another subtle interaction?


one problem is that when the system comes back up and attempts to check 
the raid array, it is not going to know which drive has valid data. I 
don't know exactly what it does in that situation, but this type of error 
in other conditions causes the system to take the array offline.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-23 Thread Mark Mielke


On 02/23/2010 04:22 PM, da...@lang.hm wrote:

On Tue, 23 Feb 2010, Aidan Van Dyk wrote:


* da...@lang.hm da...@lang.hm [100223 15:05]:


However, one thing that you do not get protection against with software
raid is the potential for the writes to hit some drives but not others.
If this happens the software raid cannot know what the correct contents
of the raid stripe are, and so you could loose everything in that 
stripe

(including contents of other files that are not being modified that
happened to be in the wrong place on the array)


That's for stripe-based raid.  Mirror sets like raid-1 should give you
either the old data, or the new data, both acceptable responses since
the fsync/barreir hasn't completed.

Or have I missed another subtle interaction?


one problem is that when the system comes back up and attempts to 
check the raid array, it is not going to know which drive has valid 
data. I don't know exactly what it does in that situation, but this 
type of error in other conditions causes the system to take the array 
offline.


I think the real concern here is that depending on how the data is read 
later - and depending on which disks it reads from - it could read 
*either* old or new, at any time in the future. I.e. it reads new from 
disk 1 the first time, and then an hour later it reads old from disk 2.


I think this concern might be invalid for a properly running system, 
though. When a RAID array is not cleanly shut down, the RAID array 
should run in degraded mode until it can be sure that the data is 
consistent. In this case, it should pick one drive, and call it the 
live one, and then rebuild the other from the live one. Until it is 
re-built, it should only satisfy reads from the live one, or parts of 
the rebuilding one that are known to be clean.


I use mdadm software RAID, and all of me reading (including some of its 
source code) and experience (shutting down the box uncleanly) tells me, 
it is working properly. In fact, the rebuild process can get quite 
ANNOYING as the whole system becomes much slower during rebuild, and 
rebuild of large partitions can take hours to complete.


For mdadm, there is a not-so-well-known write-intent bitmap 
capability. Once enabled, mdadm will embed a small bitmap (128 bits?) 
into the partition, and each bit will indicate a section of the 
partition. Before writing to a section, it will mark that section as 
dirty using this bitmap. It will leave this bit set for some time after 
the partition is clean (lazy clear). The effect of this, is that at 
any point in time, only certain sections of the drive are dirty, and on 
recovery, it is a lot cheaper to only rebuild the dirty sections. It 
works really well.


So, I don't think this has to be a problem. There are solutions, and any 
solution that claims to be complete should offer these sorts of 
capabilities.


Cheers,
mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Bruce Momjian

Greg Smith wrote:
 Ron Mayer wrote:
  Bruce Momjian wrote:

  Agreed, thought I thought the problem was that SSDs lie about their
  cache flush like SATA drives do, or is there something I am missing?
  
 
  There's exactly one case I can find[1] where this century's IDE
  drives lied more than any other drive with a cache:
 
 Ron is correct that the problem of mainstream SATA drives accepting the 
 cache flush command but not actually doing anything with it is long gone 
 at this point.  If you have a regular SATA drive, it almost certainly 
 supports proper cache flushing.  And if your whole software/storage 
 stacks understands all that, you should not end up with corrupted data 
 just because there's a volative write cache in there.

OK, but I have a few questions.  Is a write to the drive and a cache
flush command the same?  Which file systems implement both?  I thought a
write to the drive was always assumed to flush it to the platters,
assuming the drive's cache is set to write-through.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Bruce Momjian

Ron Mayer wrote:
 Bruce Momjian wrote:
  Agreed, thought I thought the problem was that SSDs lie about their
  cache flush like SATA drives do, or is there something I am missing?
 
 There's exactly one case I can find[1] where this century's IDE
 drives lied more than any other drive with a cache:
 
   Under 120GB Maxtor drives from late 2003 to early 2004.
 
 and it's apparently been worked around for years.
 
 Those drives claimed to support the FLUSH_CACHE_EXT feature (IDE
 command 0xEA), but did not support sending 48-bit commands which
 was needed to send the cache flushing command.
 
 And for that case a workaround for Linux was quickly identified by
 checking for *both* the support for 48-bit commands and support for the
 flush cache extension[2].
 
 
 Beyond those 2004 drive + 2003 kernel systems, I think most the rest
 of such reports have been various misfeatures in some of Linux's
 filesystems (like EXT3 that only wants to send drives cache-flushing
 commands when inode change[3]) and linux software raid misfeatures
 
 ...and ISTM those would affect SSDs the same way they'd affect SATA drives.

I think the point is not that drives lie about their write-back and
write-through behavior, but rather that many SATA/IDE drives default to
write-back, and not write-through, and many administrators an file
systems are not aware of this behavior.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Ron Mayer

Bruce Momjian wrote:
 Greg Smith wrote:
  If you have a regular SATA drive, it almost certainly 
 supports proper cache flushing
 
 OK, but I have a few questions.  Is a write to the drive and a cache
 flush command the same?

I believe they're different as of ATAPI-6 from 2001.

 Which file systems implement both?

Seems ZFS and recent ext4 have thought these interactions out
thoroughly.   Find a slow ext4 that people complain about, and
that's the one doing it right :-).

Ext3 has some particularly odd annoyances where it flushes and waits
for certain writes (ones involving inode changes) but doesn't bother
to flush others (just data changes).   As far as I can tell, with
ext3 you need userspace utilities to make sure flushes occur when
you need them.At one point I was tempted to try to put such
userspace hacks into postgres.

I know less about other file systems.  Apparently the NTFS guys
are aware of such stuff - but don't know what kinds of fsync equivalent
you'd need to make it happen.

Also worth noting - Linux's software raid stuff (MD and LVM)
need to handle this right as well - and last I checked (sometime
last year) the default setups didn't.

  I thought a
 write to the drive was always assumed to flush it to the platters,
 assuming the drive's cache is set to write-through.

Apparently somewhere around here:
http://www.t10.org/t13/project/d1410r3a-ATA-ATAPI-6.pdf
they were separated in the IDE world.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Mark Mielke


On 02/22/2010 08:04 PM, Greg Smith wrote:

Arjen van der Meijden wrote:

That's weird. Intel's SSD's didn't have a write cache afaik:
I asked Intel about this and it turns out that the DRAM on the Intel 
drive isn't used for user data because of the risk of data loss, 
instead it is used as memory by the Intel SATA/flash controller for 
deciding exactly where to write data (I'm assuming for the wear 
leveling/reliability algorithms).

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403p=10


Read further down:

Despite the presence of the external DRAM, both the Intel controller 
and the JMicron rely on internal buffers to cache accesses to the 
SSD...Intel's controller has a 256KB SRAM on-die.


That's the problematic part:  the Intel controllers have a volatile 
256KB write cache stored deep inside the SSD controller, and issuing a 
standard SATA write cache flush command doesn't seem to clear it.  
Makes the drives troublesome for database use.


I had read the above when posted, and then looked up SRAM. SRAM seems to 
suggest it will hold the data even after power loss, but only for a 
period of time. As long as power can restore within a few minutes, it 
seemed like this would be ok?


I can understand a SSD might do unexpected things when it loses power 
all of a sudden. It will probably try to group writes to fill a 
single block (and those blocks vary in size but are normally way 
larger than those of a normal spinning disk, they are values like 256 
or 512KB) and it might loose that waiting until a full block can be 
written-data or perhaps it just couldn't complete a full block-write 
due to the power failure.
Although that behavior isn't really what you want, it would be 
incorrect to blame write caching for the behavior if the device 
doesn't even have a write cache ;)


If you write data and that write call returns before the data hits 
disk, it's a write cache, period.  And if that write cache loses its 
contents if power is lost, it's a volatile write cache that can cause 
database corruption.  The fact that the one on the Intel devices is 
very small, basically just dealing with the block chunking behavior 
you describe, doesn't change either of those facts.




The SRAM seems to suggest that it does not necessarily lose its contents 
if power is lost - it just doesn't say how long you have to plug it back 
in. Isn't this similar to a battery-backed cache or capacitor-backed cache?


I'd love to have a better guarantee - but is SRAM really such a bad model?

Cheers,
mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Greg Smith


Ron Mayer wrote:

I know less about other file systems.  Apparently the NTFS guys
are aware of such stuff - but don't know what kinds of fsync equivalent
you'd need to make it happen.
  


It's actually pretty straightforward--better than ext3.  Windows with 
NTFS has been perfectly aware how to do write-through on drives that 
support it when you execute _commit for some time: 
http://msdn.microsoft.com/en-us/library/17618685(VS.80).aspx


If you switch the postgresql.conf setting to fsync_writethrough on 
Windows, it will execute _commit where it would execute fsync on other 
platforms, and that pushes through the drive's caches as it should 
(unlike fsync in many cases).  More about this at 
http://archives.postgresql.org/pgsql-hackers/2005-08/msg00227.php and 
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm (which 
also covers OS X).


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Scott Marlowe

On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith g...@2ndquadrant.com wrote:
 Mark Mielke wrote:

 I had read the above when posted, and then looked up SRAM. SRAM seems to
 suggest it will hold the data even after power loss, but only for a period
 of time. As long as power can restore within a few minutes, it seemed like
 this would be ok?

 The normal type of RAM everyone uses is DRAM, which requires constrant
 refresh cycles to keep it working and is pretty power hungry as a result.
  Power gone, data gone an instant later.

Actually, oddly enough, per bit stored dram is much lower power usage
than sram, because it only has something like 2 transistors per bit,
while sram needs something like 4 or 5 (it's been a couple decades
since I took the classes on each).  Even with the constant refresh,
dram has a lower power draw than sram.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-22 Thread Scott Marlowe

On Mon, Feb 22, 2010 at 7:21 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Mon, Feb 22, 2010 at 6:39 PM, Greg Smith g...@2ndquadrant.com wrote:
 Mark Mielke wrote:

 I had read the above when posted, and then looked up SRAM. SRAM seems to
 suggest it will hold the data even after power loss, but only for a period
 of time. As long as power can restore within a few minutes, it seemed like
 this would be ok?

 The normal type of RAM everyone uses is DRAM, which requires constrant
 refresh cycles to keep it working and is pretty power hungry as a result.
  Power gone, data gone an instant later.

 Actually, oddly enough, per bit stored dram is much lower power usage
 than sram, because it only has something like 2 transistors per bit,
 while sram needs something like 4 or 5 (it's been a couple decades
 since I took the classes on each).  Even with the constant refresh,
 dram has a lower power draw than sram.

Note that's power draw per bit.  dram is usually much more densely
packed (it can be with fewer transistors per cell) so the individual
chips for each may have similar power draws while the dram will be 10
times as densely packed as the sram.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-21 Thread Bruce Momjian

Scott Carey wrote:
 On Feb 20, 2010, at 3:19 PM, Bruce Momjian wrote:
 
  Dan Langille wrote:
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Bruce Momjian wrote:
  Matthew Wakeling wrote:
  On Fri, 13 Nov 2009, Greg Smith wrote:
  In order for a drive to work reliably for database use such as for
  PostgreSQL, it cannot have a volatile write cache.  You either need a 
  write
  cache with a battery backup (and a UPS doesn't count), or to turn the 
  cache
  off.  The SSD performance figures you've been looking at are with the 
  drive's
  write cache turned on, which means they're completely fictitious and
  exaggerated upwards for your purposes.  In the real world, that will 
  result
  in database corruption after a crash one day.
  Seagate are claiming to be on the ball with this one.
 
  http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
 
  I have updated our documentation to mention that even SSD drives often
  have volatile write-back caches.  Patch attached and applied.
 
  Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
  Do the characteristics of ZFS avoid this issue entirely?
 
  No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
  assumes something sent to the drive is permanent or it would have no way
  to operate.
 
 
 ZFS is write-back cache aware, and safe provided the drive's
 cache flushing and write barrier related commands work.  It will
 flush data in 'transaction groups' and flush the drive write
 caches at the end of those transactions.  Since its copy on
 write, it can ensure that all the changes in the transaction
 group appear on disk, or all are lost.  This all works so long
 as the cache flush commands do.

Agreed, thought I thought the problem was that SSDs lie about their
cache flush like SATA drives do, or is there something I am missing?

--
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-21 Thread Ron Mayer

Bruce Momjian wrote:
 Agreed, thought I thought the problem was that SSDs lie about their
 cache flush like SATA drives do, or is there something I am missing?

There's exactly one case I can find[1] where this century's IDE
drives lied more than any other drive with a cache:

  Under 120GB Maxtor drives from late 2003 to early 2004.

and it's apparently been worked around for years.

Those drives claimed to support the FLUSH_CACHE_EXT feature (IDE
command 0xEA), but did not support sending 48-bit commands which
was needed to send the cache flushing command.

And for that case a workaround for Linux was quickly identified by
checking for *both* the support for 48-bit commands and support for the
flush cache extension[2].


Beyond those 2004 drive + 2003 kernel systems, I think most the rest
of such reports have been various misfeatures in some of Linux's
filesystems (like EXT3 that only wants to send drives cache-flushing
commands when inode change[3]) and linux software raid misfeatures

...and ISTM those would affect SSDs the same way they'd affect SATA drives.


[1] http://lkml.org/lkml/2004/5/12/132
[2] http://lkml.org/lkml/2004/5/12/200
[3] http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-21 Thread Greg Smith


Ron Mayer wrote:

Bruce Momjian wrote:
  

Agreed, thought I thought the problem was that SSDs lie about their
cache flush like SATA drives do, or is there something I am missing?



There's exactly one case I can find[1] where this century's IDE
drives lied more than any other drive with a cache:


Ron is correct that the problem of mainstream SATA drives accepting the 
cache flush command but not actually doing anything with it is long gone 
at this point.  If you have a regular SATA drive, it almost certainly 
supports proper cache flushing.  And if your whole software/storage 
stacks understands all that, you should not end up with corrupted data 
just because there's a volative write cache in there.


But the point of this whole testing exercise coming back into vogue 
again is that SSDs have returned this negligent behavior to the 
mainstream again.  See 
http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion 
of this in a ZFS context just last month.  There are many documented 
cases of Intel SSDs that will fake a cache flush, such that the only way 
to get good reliable writes is to totally disable their writes 
caches--at which point performance is so bad you might as well have 
gotten a RAID10 setup instead (and longevity is toast too).


This whole area remains a disaster area and extreme distrust of all the 
SSD storage vendors is advisable at this point.  Basically, if I don't 
see the capacitor responsible for flushing outstanding writes, and get a 
clear description from the manufacturer how the cached writes are going 
to be handled in the event of a power failure, at this point I have to 
assume the answer is badly and your data will be eaten.  And the 
prices for SSDs that meet that requirement are still quite steep.  I 
keep hoping somebody will address this market at something lower than 
the standard enterprise prices.  The upcoming SandForce designs seem 
to have thought this through correctly:  
http://www.anandtech.com/storage/showdoc.aspx?i=3702p=6  But the 
product's not out to the general public yet (just like the Seagate units 
that claim to have capacitor backups--I heard a rumor those are also 
Sandforce designs actually, so they may be the only ones doing this 
right and aiming at a lower price).


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us

Re: [PERFORM] SSD + RAID

2010-02-21 Thread Arjen van der Meijden


On 22-2-2010 6:39 Greg Smith wrote:

But the point of this whole testing exercise coming back into vogue
again is that SSDs have returned this negligent behavior to the
mainstream again. See
http://opensolaris.org/jive/thread.jspa?threadID=121424 for a discussion
of this in a ZFS context just last month. There are many documented
cases of Intel SSDs that will fake a cache flush, such that the only way
to get good reliable writes is to totally disable their writes
caches--at which point performance is so bad you might as well have
gotten a RAID10 setup instead (and longevity is toast too).


That's weird. Intel's SSD's didn't have a write cache afaik:
I asked Intel about this and it turns out that the DRAM on the Intel 
drive isn't used for user data because of the risk of data loss, instead 
it is used as memory by the Intel SATA/flash controller for deciding 
exactly where to write data (I'm assuming for the wear 
leveling/reliability algorithms).

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403p=10

But that is the old version, perhaps the second generation does have a 
bit of write caching.


I can understand a SSD might do unexpected things when it loses power 
all of a sudden. It will probably try to group writes to fill a single 
block (and those blocks vary in size but are normally way larger than 
those of a normal spinning disk, they are values like 256 or 512KB) and 
it might loose that waiting until a full block can be written-data or 
perhaps it just couldn't complete a full block-write due to the power 
failure.
Although that behavior isn't really what you want, it would be incorrect 
to blame write caching for the behavior if the device doesn't even have 
a write cache ;)


Best regards,

Arjen


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-20 Thread Bruce Momjian

Matthew Wakeling wrote:
 On Fri, 13 Nov 2009, Greg Smith wrote:
  In order for a drive to work reliably for database use such as for 
  PostgreSQL, it cannot have a volatile write cache.  You either need a write 
  cache with a battery backup (and a UPS doesn't count), or to turn the cache 
  off.  The SSD performance figures you've been looking at are with the 
  drive's 
  write cache turned on, which means they're completely fictitious and 
  exaggerated upwards for your purposes.  In the real world, that will result 
  in database corruption after a crash one day.
 
 Seagate are claiming to be on the ball with this one.
 
 http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/

I have updated our documentation to mention that even SSD drives often
have volatile write-back caches.  Patch attached and applied.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.61
diff -c -c -r1.61 wal.sgml
*** doc/src/sgml/wal.sgml	3 Feb 2010 17:25:06 -	1.61
--- doc/src/sgml/wal.sgml	20 Feb 2010 18:26:40 -
***
*** 59,65 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure.  To check write caching on productnameLinux/ use
 commandhdparm -I/;  it is enabled if there is a literal*/ next
 to literalWrite cache/; commandhdparm -W/ to turn off
 write caching.  On productnameFreeBSD/ use
--- 59,66 
 same concerns about data loss exist for write-back drive caches as
 exist for disk controller caches.  Consumer-grade IDE and SATA drives are
 particularly likely to have write-back caches that will not survive a
!power failure.  Many solid-state drives also have volatile write-back
!caches.  To check write caching on productnameLinux/ use
 commandhdparm -I/;  it is enabled if there is a literal*/ next
 to literalWrite cache/; commandhdparm -W/ to turn off
 write caching.  On productnameFreeBSD/ use

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-20 Thread Dan Langille

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Bruce Momjian wrote:
 Matthew Wakeling wrote:
 On Fri, 13 Nov 2009, Greg Smith wrote:
 In order for a drive to work reliably for database use such as for 
 PostgreSQL, it cannot have a volatile write cache.  You either need a write 
 cache with a battery backup (and a UPS doesn't count), or to turn the cache 
 off.  The SSD performance figures you've been looking at are with the 
 drive's 
 write cache turned on, which means they're completely fictitious and 
 exaggerated upwards for your purposes.  In the real world, that will result 
 in database corruption after a crash one day.
 Seagate are claiming to be on the ball with this one.

 http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
 
 I have updated our documentation to mention that even SSD drives often
 have volatile write-back caches.  Patch attached and applied.

Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
Do the characteristics of ZFS avoid this issue entirely?

- --
Dan Langille

BSDCan - The Technical BSD Conference : http://www.bsdcan.org/
PGCon  - The PostgreSQL Conference: http://www.pgcon.org/
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.13 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuAayQACgkQCgsXFM/7nTyMggCgnZUbVzldxjp/nPo8EL1Nq6uG
6+IAoNGIB9x8/mwUQidjM9nnAADRbr9j
=3RJi
-END PGP SIGNATURE-

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2010-02-20 Thread Bruce Momjian

Dan Langille wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Bruce Momjian wrote:
  Matthew Wakeling wrote:
  On Fri, 13 Nov 2009, Greg Smith wrote:
  In order for a drive to work reliably for database use such as for 
  PostgreSQL, it cannot have a volatile write cache.  You either need a 
  write 
  cache with a battery backup (and a UPS doesn't count), or to turn the 
  cache 
  off.  The SSD performance figures you've been looking at are with the 
  drive's 
  write cache turned on, which means they're completely fictitious and 
  exaggerated upwards for your purposes.  In the real world, that will 
  result 
  in database corruption after a crash one day.
  Seagate are claiming to be on the ball with this one.
 
  http://www.theregister.co.uk/2009/12/08/seagate_pulsar_ssd/
  
  I have updated our documentation to mention that even SSD drives often
  have volatile write-back caches.  Patch attached and applied.
 
 Hmmm.  That got me thinking: consider ZFS and HDD with volatile cache.
 Do the characteristics of ZFS avoid this issue entirely?

No, I don't think so.  ZFS only avoids partial page writes.  ZFS still
assumes something sent to the drive is permanent or it would have no way
to operate.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com
  PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do
  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-12-03 Thread Scott Carey


On 11/19/09 1:04 PM, Greg Smith g...@2ndquadrant.com wrote:

 That won't help.  Once the checkpoint is done, the problem isn't just
 that the WAL segments are recycled.  The server isn't going to use them
 even if they were there.  The reason why you can erase/recycle them is
 that you're doing so *after* writing out a checkpoint record that says
 you don't have to ever look at them again.  What you'd actually have to
 do is hack the server code to insert that delay after every fsync--there
 are none that you can cheat on and not introduce a corruption
 possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't
 make a lot of assumptions about what the underlying disk has to actually
 do beyond the fsync requirement; the flip side to that robustness is
 that it's the one you can't ever violate safely.

Yeah, I guess its not so easy.  Having the system hold one extra
checkpoint worth of segments and then during recovery, always replay that
previoius one plus the current might work, but I don't know if that could
cause corruption.  I assume replaying a log twice won't, so replaying N-1
checkpoint, then the current one, might work.  If so that would be a cool
feature -- so long as the N-2 checkpoint is no longer in the OS or I/O
hardware caches when checkpoint N completes, you're safe!  Its probably more
complicated though, especially with respect to things like MVCC on DDL
changes.

 Right.  It's not used like the write-cache on a regular hard drive,
 where they're buffering 8MB-32MB worth of writes just to keep seek
 overhead down.  It's there primarily to allow combining writes into
 large chunks, to better match the block size of the underlying SSD flash
 cells (128K).  Having enough space for two full cells allows spooling
 out the flash write to a whole block while continuing to buffer the next
 one.
 
 This is why turning the cache off can tank performance so badly--you're
 going to be writing a whole 128K block no matter what if it's force to
 disk without caching, even if it's just to write a 8K page to it.

As others mentioned, flash must erase a whole block at once, but it can
write sequentially to a block in much smaller chunks.   I believe that MLC
and SLC differ a bit here, SLC can write smaller subsections of the erase
block.

A little old but still very useful:
http://research.microsoft.com/apps/pubs/?id=63596

 That's only going to reach 1/16 of the usual write speed on single page
 writes.  And that's why you should also be concerned at whether
 disabling the write cache impacts the drive longevity, lots of small
 writes going out in small chunks is going to wear flash out much faster
 than if the drive is allowed to wait until it's got a full sized block
 to write every time.

This is still a concern, since even if the SLC cells are technically capable
of writing sequentially in smaller chunks, with the write cache off they may
not do so.  

 
 The fact that the cache is so small is also why it's harder to catch the
 drive doing the wrong thing here.  The plug test is pretty sensitive to
 a problem when you've got megabytes worth of cached writes that are
 spooling to disk at spinning hard drive speeds.  The window for loss on
 a SSD with no seek overhead and only a moderate number of KB worth of
 cached data is much, much smaller.  Doesn't mean it's gone though.  It's
 a shame that the design wasn't improved just a little bit; a cheap
 capacitor and blocking new writes once the incoming power dropped is all
 it would take to make these much more reliable for database use.  But
 that would raise the price, and not really help anybody but the small
 subset of the market that cares about durable writes.

Yup.  There are manufacturers who claim no data loss on power failure,
hopefully these become more common.
http://www.wdc.com/en/products/ssd/technology.asp?id=1

I still contend its a lot more safe than a hard drive.  I have not seen one
fail yet (out of about 150 heavy use drive-years on X25-Ms).  Any system
that does not have a battery backed write cache will be faster and safer if
an SSD, with write cache on, than hard drives with write cache on.

BBU caching is not fail-safe either, batteries wear out, cards die or
malfunction.
If you need the maximum data integrity, you will probably go with a
battery-backed cache raid setup with or without SSDs.  If you don't go that
route SSD's seem like the best option.  The 'middle ground' of software raid
with hard drives with their write caches off doesn't seem useful to me at
all.  I can't think of one use case that isn't better served by a slightly
cheaper array of disks with a hardware bbu card (if the data is important or
data size is large) OR a set of SSD's (if performance is more important than
data safety). 

 4: Yet another solution:  The drives DO adhere to write barriers properly.
 A filesystem that used these in the process of fsync() would be fine too.
 So XFS without LVM or MD (or the newer versions of those that don't

Re: [PERFORM] SSD + RAID

2009-11-30 Thread Bruce Momjian

Greg Smith wrote:
 Bruce Momjian wrote:
  I thought our only problem was testing the I/O subsystem --- I never
  suspected the file system might lie too.  That email indicates that a
  large percentage of our install base is running on unreliable file
  systems --- why have I not heard about this before?  Do the write
  barriers allow data loss but prevent data inconsistency?  It sound like
  they are effectively running with synchronous_commit = off.

 You might occasionally catch me ranting here that Linux write barriers 
 are not a useful solution at all for PostgreSQL, and that you must turn 
 the disk write cache off rather than expect the barrier implementation 
 to do the right thing.  This sort of buginess is why.  The reason why it 
 doesn't bite more people is that most Linux systems don't turn on write 
 barrier support by default, and there's a number of situations that can 
 disable barriers even if you did try to enable them.  It's still pretty 
 unusual to have a working system with barriers turned on nowadays; I 
 really doubt it's a large percentage of our install base.

Ah, so it is only when write barriers are enabled, and they are not
enabled by default --- OK, that makes sense.

 I've started keeping most of my notes about where ext3 is vulnerable to 
 issues in Wikipedia, specifically
 http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just 
 updated that section to point out the specific issue Ron pointed out.  
 Maybe we should point people toward that in the docs, I try to keep that 
 article correct.

Yes, good idea.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-30 Thread Ron Mayer

Bruce Momjian wrote:
 For example, ext3 fsync() will issue write barrier commands
 if the inode was modified; but not if the inode wasn't.

 See test program here:
 http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html
 and read two paragraphs further to see how touching
 the inode makes ext3 fsync behave differently.
 
 I thought our only problem was testing the I/O subsystem --- I never
 suspected the file system might lie too.  That email indicates that a
 large percentage of our install base is running on unreliable file
 systems --- why have I not heard about this before?  

It came up a on these lists a few times in the past.  Here's one example.
http://archives.postgresql.org/pgsql-performance/2008-08/msg00159.php

As far as I can tell, most of the threads ended with people still
suspecting lying hard drives.  But to the best of my ability I can't
find any drives that actually lie when sent the commands to flush
their caches.  But various combinations of ext3  linux MD that
decide not to send IDE FLUSH_CACHE_EXT (nor the similiar
SCSI SYNCHRONIZE CACHE command) under various situations.

I wonder if there are enough ext3 users out there that postgres should
touch the inodes before doing a fsync.

 Do the write barriers allow data loss but prevent data inconsistency?  

If I understand right, data inconsistency could occur too.  One
aspect of the write barriers is flushing a hard drive's caches.

 It sound like they are effectively running with synchronous_commit = off.

And with the (mythical?) hard drive with lying caches.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-30 Thread Ron Mayer

Bruce Momjian wrote:
 Greg Smith wrote:
 Bruce Momjian wrote:
 I thought our only problem was testing the I/O subsystem --- I never
 suspected the file system might lie too.  That email indicates that a
 large percentage of our install base is running on unreliable file
 systems --- why have I not heard about this before?
   
 he reason why it 
 doesn't bite more people is that most Linux systems don't turn on write 
 barrier support by default, and there's a number of situations that can 
 disable barriers even if you did try to enable them.  It's still pretty 
 unusual to have a working system with barriers turned on nowadays; I 
 really doubt it's a large percentage of our install base.
 
 Ah, so it is only when write barriers are enabled, and they are not
 enabled by default --- OK, that makes sense.

The test program I linked up-thread shows that fsync does nothing
unless the inode's touched on an out-of-the-box Ubuntu 9.10 using
ext3 on a straight from Dell system.

Surely that's a common config, no?

If I uncomment the fchmod lines below I can see that even with ext3
and write caches enabled on my drives it does indeed wait.
Note that EXT4 doesn't show the problem on the same system.

Here's a slightly modified test program that's a bit easier to run.
If you run the program and it exits right away, your system isn't
waiting for platters to spin.


/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
** If this program returns instantly, the fsync() lied.
** If it takes a second or so, fsync() probably works.
** On ext3 and drives that cache writes, you probably need
** to uncomment the fchmod's to make fsync work right.
*/
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include unistd.h
#include stdio.h
#include stdlib.h

int main(int argc,char *argv[]) {
  if (argc2) {
printf(usage: fs filename\n);
exit(1);
  }
  int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
  int i;
  for (i=0;i100;i++) {
char byte;
pwrite (fd, byte, 1, 0);
// fchmod (fd, 0644); fchmod (fd, 0664);
fsync (fd);
  }
}

r...@ron-desktop:/tmp$ /usr/bin/time ./a.out foo
0.00user 0.00system 0:00.01elapsed 21%CPU (0avgtext+0avgdata 0maxresident)k



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-30 Thread Bruce Momjian

Ron Mayer wrote:
 Bruce Momjian wrote:
  Greg Smith wrote:
  Bruce Momjian wrote:
  I thought our only problem was testing the I/O subsystem --- I never
  suspected the file system might lie too.  That email indicates that a
  large percentage of our install base is running on unreliable file
  systems --- why have I not heard about this before?

  he reason why it 
  doesn't bite more people is that most Linux systems don't turn on write 
  barrier support by default, and there's a number of situations that can 
  disable barriers even if you did try to enable them.  It's still pretty 
  unusual to have a working system with barriers turned on nowadays; I 
  really doubt it's a large percentage of our install base.
  
  Ah, so it is only when write barriers are enabled, and they are not
  enabled by default --- OK, that makes sense.
 
 The test program I linked up-thread shows that fsync does nothing
 unless the inode's touched on an out-of-the-box Ubuntu 9.10 using
 ext3 on a straight from Dell system.
 
 Surely that's a common config, no?

Yea, this certainly suggests that the problem is wide-spread.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-29 Thread Ron Mayer

Bruce Momjian wrote:
 Greg Smith wrote:
 A good test program that is a bit better at introducing and detecting 
 the write cache issue is described at 
 http://brad.livejournal.com/2116715.html
 
 Wow, I had not seen that tool before.  I have added a link to it from
 our documentation, and also added a mention of our src/tools/fsync test
 tool to our docs.

One challenge with many of these test programs is that some
filesystem (ext3 is one) will flush drive caches on fsync()
*sometimes, but not always.   If your test program happens to do
a sequence of commands that makes an fsync() actually flush a
disk's caches, it might mislead you if your actual application
has a different series of system calls.

For example, ext3 fsync() will issue write barrier commands
if the inode was modified; but not if the inode wasn't.

See test program here:
http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html
and read two paragraphs further to see how touching
the inode makes ext3 fsync behave differently.




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-29 Thread Bruce Momjian

Ron Mayer wrote:
 Bruce Momjian wrote:
  Greg Smith wrote:
  A good test program that is a bit better at introducing and detecting 
  the write cache issue is described at 
  http://brad.livejournal.com/2116715.html
  
  Wow, I had not seen that tool before.  I have added a link to it from
  our documentation, and also added a mention of our src/tools/fsync test
  tool to our docs.
 
 One challenge with many of these test programs is that some
 filesystem (ext3 is one) will flush drive caches on fsync()
 *sometimes, but not always.   If your test program happens to do
 a sequence of commands that makes an fsync() actually flush a
 disk's caches, it might mislead you if your actual application
 has a different series of system calls.
 
 For example, ext3 fsync() will issue write barrier commands
 if the inode was modified; but not if the inode wasn't.
 
 See test program here:
 http://www.mail-archive.com/linux-ker...@vger.kernel.org/msg272253.html
 and read two paragraphs further to see how touching
 the inode makes ext3 fsync behave differently.

I thought our only problem was testing the I/O subsystem --- I never
suspected the file system might lie too.  That email indicates that a
large percentage of our install base is running on unreliable file
systems --- why have I not heard about this before?  Do the write
barriers allow data loss but prevent data inconsistency?  It sound like
they are effectively running with synchronous_commit = off.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-29 Thread Greg Smith


Bruce Momjian wrote:

I thought our only problem was testing the I/O subsystem --- I never
suspected the file system might lie too.  That email indicates that a
large percentage of our install base is running on unreliable file
systems --- why have I not heard about this before?  Do the write
barriers allow data loss but prevent data inconsistency?  It sound like
they are effectively running with synchronous_commit = off.
  
You might occasionally catch me ranting here that Linux write barriers 
are not a useful solution at all for PostgreSQL, and that you must turn 
the disk write cache off rather than expect the barrier implementation 
to do the right thing.  This sort of buginess is why.  The reason why it 
doesn't bite more people is that most Linux systems don't turn on write 
barrier support by default, and there's a number of situations that can 
disable barriers even if you did try to enable them.  It's still pretty 
unusual to have a working system with barriers turned on nowadays; I 
really doubt it's a large percentage of our install base.


I've started keeping most of my notes about where ext3 is vulnerable to 
issues in Wikipedia, specifically
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal ; I just 
updated that section to point out the specific issue Ron pointed out.  
Maybe we should point people toward that in the docs, I try to keep that 
article correct.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-28 Thread Bruce Momjian

Greg Smith wrote:
 Merlin Moncure wrote:
  I am right now talking to someone on postgresql irc who is measuring
  15k iops from x25-e and no data loss following power plug test.
 The funny thing about Murphy is that he doesn't visit when things are 
 quiet.  It's quite possible the window for data loss on the drive is 
 very small.  Maybe you only see it one out of 10 pulls with a very 
 aggressive database-oriented write test.  Whatever the odd conditions 
 are, you can be sure you'll see them when there's a bad outage in actual 
 production though.
 
 A good test program that is a bit better at introducing and detecting 
 the write cache issue is described at 
 http://brad.livejournal.com/2116715.html

Wow, I had not seen that tool before.  I have added a link to it from
our documentation, and also added a mention of our src/tools/fsync test
tool to our docs.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/config.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/config.sgml,v
retrieving revision 1.233
diff -c -c -r1.233 config.sgml
*** doc/src/sgml/config.sgml	13 Nov 2009 22:43:39 -	1.233
--- doc/src/sgml/config.sgml	28 Nov 2009 16:12:46 -
***
*** 1432,1437 
--- 1432,1439 
  The default is the first method in the above list that is supported
  by the platform.
  The literalopen_/* options also use literalO_DIRECT/ if available.
+ The utility filenamesrc/tools/fsync/ in the PostgreSQL source tree
+ can do performance testing of various fsync methods.
  This parameter can only be set in the filenamepostgresql.conf/
  file or on the server command line.
 /para
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.59
diff -c -c -r1.59 wal.sgml
*** doc/src/sgml/wal.sgml	9 Apr 2009 16:20:50 -	1.59
--- doc/src/sgml/wal.sgml	28 Nov 2009 16:12:57 -
***
*** 86,91 
--- 86,93 
 ensure data integrity.  Avoid disk controllers that have non-battery-backed
 write caches.  At the drive level, disable write-back caching if the
 drive cannot guarantee the data will be written before shutdown.
+You can test for reliable I/O subsystem behavior using ulink
+url=http://brad.livejournal.com/2116715.html;diskchecker.pl/ulink.
/para
  
para

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-21 Thread Merlin Moncure

On Fri, Nov 20, 2009 at 7:27 PM, Greg Smith g...@2ndquadrant.com wrote:
 Richard Neill wrote:

 The key issue for short,fast transactions seems to be
 how fast an fdatasync() call can run, forcing the commit to disk, and
 allowing the transaction to return to userspace.
 Attached is a short C program which may be of use.

 Right.  I call this the commit rate of the storage, and on traditional
 spinning disks it's slightly below the rotation speed of the media (i.e.
 7200RPM = 120 commits/second).    If you've got a battery-backed cache in
 front of standard disks, you can easily clear 10K commits/second.


...until you overflow the cache.  battery backed cache does not break
the laws of physics...it just provides a higher burst rate (plus what
ever advantages can be gained by peeking into the write queue and
re-arranging/grouping.  I learned the hard way that how your raid
controller behaves in overflow situations can cause catastrophic
performance degradations...

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-20 Thread Axel Rau



Am 13.11.2009 um 14:57 schrieb Laszlo Nagy:

I was thinking about ARECA 1320 with 2GB memory + BBU.  
Unfortunately, I cannot find information about using ARECA cards  
with SSD drives.
They told me: currently not supported, but they have positive customer  
reports. No date yet for implementation of the TRIM command in firmware.

...
My other option is to buy two SLC SSD drives and use RAID1. It would  
cost about the same, but has less redundancy and less capacity.  
Which is the faster? 8-10 MLC disks in RAID 6 with a good caching  
controller, or two SLC disks in RAID1?

I just went the MLC path with X25-Ms mainly to save energy.
The fresh assembled box has one SSD for WAL and one RAID 0 with for  
SSDs as table space.
Everything runs smoothly on a areca 1222 with BBU, which turned all  
write caches off.

OS is FreeBSD 8.0. I aligned all partitions on 1 MB boundaries.
Next week I will install 8.4.1 and run pgbench for pull-the-plug- 
testing.


I would like to get some advice from the list for testing the SSDs!

Axel
---
axel@chaos1.de  PGP-Key:29E99DD6  +49 151 2300 9283  computing @  
chaos claudius











--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-20 Thread Matthew Wakeling


On Thu, 19 Nov 2009, Greg Smith wrote:
This is why turning the cache off can tank performance so badly--you're going 
to be writing a whole 128K block no matter what if it's force to disk without 
caching, even if it's just to write a 8K page to it.


Theoretically, this does not need to be the case. Now, I don't know what 
the Intel drives actually do, but remember that for flash, it is the 
*erase* cycle that has to be done in large blocks. Writing itself can be 
done in small blocks, to previously erased sites.


The technology for combining small writes into sequential writes has been 
around for 17 years or so in 
http://portal.acm.org/citation.cfm?id=146943dl= so there really isn't any 
excuse for modern flash drives not giving really fast small writes.


Matthew

--
for a in past present future; do
  for b in clients employers associates relatives neighbours pets; do
  echo The opinions here in no way reflect the opinions of my $a $b.
done; done

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-20 Thread Jeff Janes

On Wed, Nov 18, 2009 at 8:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Scott Carey sc...@richrelevance.com writes:
 For your database DATA disks, leaving the write cache on is 100% acceptable,
 even with power loss, and without a RAID controller.  And even in high write
 environments.

 Really?  How hard have you tested that configuration?

 That is what the XLOG is for, isn't it?

 Once we have fsync'd a data change, we discard the relevant XLOG
 entries.  If the disk hasn't actually put the data on stable storage
 before it claims the fsync is done, you're screwed.

 XLOG only exists to centralize the writes that have to happen before
 a transaction can be reported committed (in particular, to avoid a
 lot of random-access writes at commit).  It doesn't make any
 fundamental change in the rules of the game: a disk that lies about
 write complete will still burn you.

 In a zero-seek-cost environment I suspect that XLOG wouldn't actually
 be all that useful.

You would still need it to guard against partial page writes, unless
we have some guarantee that those can't happen.

And once your transaction has scattered its transaction id into
various xmin and xmax over many tables, you need an atomic, durable
repository to decide if that id has or has not committed.  Maybe clog
fsynced on commit would serve this purpose?

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-20 Thread Richard Neill


Axel Rau wrote:


Am 13.11.2009 um 14:57 schrieb Laszlo Nagy:

I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately, 
I cannot find information about using ARECA cards with SSD drives.
They told me: currently not supported, but they have positive customer 
reports. No date yet for implementation of the TRIM command in firmware.

...
My other option is to buy two SLC SSD drives and use RAID1. It would 
cost about the same, but has less redundancy and less capacity. Which 
is the faster? 8-10 MLC disks in RAID 6 with a good caching 
controller, or two SLC disks in RAID1?


Despite my other problems, I've found that the Intel X25-Es work
remarkably well. The key issue for short,fast transactions seems to be
how fast an fdatasync() call can run, forcing the commit to disk, and
allowing the transaction to return to userspace.
With all the caches off, the intel X25-E beat a standard disk by a
factor of about 10.
Attached is a short C program which may be of use.


For what it's worth, we have actually got a pretty decent (and
redundant) setup using a RAIS array of RAID1.


[primary server]

SSD }
 }  RAID1  ---}  DRBD --- /var/lib/postgresql
SSD }}
  }
  }
  }
  }
[secondary server]   }
  }
SSD }}
 }  RAID1 gigE}
SSD }



The servers connect back-to-back with a dedicated Gigabit ethernet
cable, and DRBD is running in protocol B.

We can pull the power out of 1 server, and be using the next within 30
seconds, and with no dataloss.


Richard



#include string.h
#include stdio.h
#include stdlib.h
#include unistd.h
#include errno.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h

#define NUM_ITER 1024

int main ( int argc, char **argv ) {
	const char data[] = Liberate;
	size_t data_len = strlen ( data );
	const char *filename;
	int fd; 
	unsigned int i;

	if ( argc != 2 ) {
		fprintf ( stderr, Syntax: %s output_file\n, argv[0] );
		exit ( 1 );
	}
	filename = argv[1];
	fd = open ( filename, ( O_WRONLY | O_CREAT | O_EXCL ), 0666 );
	if ( fd  0 ) {
		fprintf ( stderr, Could not create \%s\: %s\n,
			  filename, strerror ( errno ) );
		exit ( 1 );
	}

	for ( i = 0 ; i  NUM_ITER ; i++ ) {
		if ( write ( fd, data, data_len ) != data_len ) {
			fprintf ( stderr, Could not write: %s\n,
  strerror ( errno ) );
			exit ( 1 );
		}
		if ( fdatasync ( fd ) != 0 ) {
			fprintf ( stderr, Could not fdatasync: %s\n,
  strerror ( errno ) );
			exit ( 1 );
		}
	}
	return 0;
}


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-20 Thread Greg Smith


Richard Neill wrote:

The key issue for short,fast transactions seems to be
how fast an fdatasync() call can run, forcing the commit to disk, and
allowing the transaction to return to userspace.
Attached is a short C program which may be of use.
Right.  I call this the commit rate of the storage, and on traditional 
spinning disks it's slightly below the rotation speed of the media (i.e. 
7200RPM = 120 commits/second).If you've got a battery-backed cache 
in front of standard disks, you can easily clear 10K commits/second.


I normally test that out with sysbench, because I use that for some 
other tests anyway:


sysbench --test=fileio --file-fsync-freq=1 --file-num=1 
--file-total-size=16384 --file-test-mode=rndwr run | grep Requests/sec


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Craig Ringer

On 19/11/2009 12:22 PM, Scott Carey wrote:

 3:  Have PG wait a half second (configurable) after the checkpoint fsync()
 completes before deleting/ overwriting any WAL segments.  This would be a
 trivial feature to add to a postgres release, I think.

How does that help? It doesn't provide any guarantee that the data has
hit main storage - it could lurk in SDD cache for hours.

 4: Yet another solution:  The drives DO adhere to write barriers properly.
 A filesystem that used these in the process of fsync() would be fine too.
 So XFS without LVM or MD (or the newer versions of those that don't ignore
 barriers) would work too.

*if* the WAL is also on the SSD.

If the WAL is on a separate drive, the write barriers do you no good,
because they won't ensure that the data hits the main drive storage
before the WAL recycling hits the WAL disk storage. The two drives
operate independently and the write barriers don't interact.

You'd need some kind of inter-drive write barrier.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith


Scott Carey wrote:

For your database DATA disks, leaving the write cache on is 100% acceptable,
even with power loss, and without a RAID controller.  And even in high write
environments.

That is what the XLOG is for, isn't it?  That is where this behavior is
critical.  But that has completely different performance requirements and
need not bee on the same volume, array, or drive.
  
At checkpoint time, writes to the main data files are done that are 
followed by fsync calls to make sure those blocks have been written to 
disk.  Those writes have exactly the same consistency requirements as 
the more frequent pg_xlog writes.  If the drive ACKs the write, but it's 
not on physical disk yet, it's possible for the checkpoint to finish and 
the underlying pg_xlog segments needed to recover from a crash at that 
point to be deleted.  The end of the checkpoint can wipe out many WAL 
segments, presuming they're not needed anymore because the data blocks 
they were intended to fix during recovery are now guaranteed to be on disk.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Karl Denninger

Greg Smith wrote:
 Scott Carey wrote:
 For your database DATA disks, leaving the write cache on is 100%
 acceptable,
 even with power loss, and without a RAID controller.  And even in
 high write
 environments.

 That is what the XLOG is for, isn't it?  That is where this behavior is
 critical.  But that has completely different performance requirements
 and
 need not bee on the same volume, array, or drive.
   
 At checkpoint time, writes to the main data files are done that are
 followed by fsync calls to make sure those blocks have been written to
 disk.  Those writes have exactly the same consistency requirements as
 the more frequent pg_xlog writes.  If the drive ACKs the write, but
 it's not on physical disk yet, it's possible for the checkpoint to
 finish and the underlying pg_xlog segments needed to recover from a
 crash at that point to be deleted.  The end of the checkpoint can wipe
 out many WAL segments, presuming they're not needed anymore because
 the data blocks they were intended to fix during recovery are now
 guaranteed to be on disk.
Guys, read that again.

IF THE DISK OR DRIVER ACK'S A FSYNC CALL THE WAL ENTRY IS LIKELY GONE,
AND YOU ARE SCREWED IF THE DATA IS NOT REALLY ON THE DISK.

-- Karl
attachment: karl.vcf
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith


Scott Carey wrote:

Moral of the story:  Nothing is 100% safe, so sometimes a small bit of KNOWN
risk is perfectly fine.  There is always UNKNOWN risk.  If one risks losing
256K of cached data on an SSD if you're really unlucky with timing, how
dangerous is that versus the chance that the raid card or other hardware
barfs and takes out your whole WAL?
  
I think the point of the paranoia in this thread is that if you're 
introducing a component with a known risk in it, you're really asking 
for trouble because (as you point out) it's hard enough to keep a system 
running just through the unexpected ones that shouldn't have happened at 
all.  No need to make that even harder by introducing something that is 
*known* to fail under some conditions.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Scott Marlowe

On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Wed, Nov 18, 2009 at 11:39 PM, Scott Carey sc...@richrelevance.com wrote:
 Well, that is sort of true for all benchmarks, but I do find that bonnie++
 is the worst of the bunch.  I consider it relatively useless compared to
 fio.  Its just not a great benchmark for server type load and I find it
 lacking in the ability to simulate real applications.

 I agree.   My biggest gripe with bonnie actually is that 99% of the
 time is spent measuring in sequential tests which is not that
 important in the database world.  Dedicated wal volume uses ostensibly
 sequential io, but it's fairly difficult to outrun a dedicated wal
 volume even if it's on a vanilla sata drive.

 pgbench is actually a pretty awesome i/o tester assuming you have big
 enough scaling factor, because:
 a) it's much closer to the environment you will actually run in
 b) you get to see what i/o affecting options have on the load
 c) you have broad array of options regarding what gets done (select
 only, -f, etc)
 d) once you build the test database, you can do multiple runs without
 rebuilding it

Seeing as how pgbench only goes to scaling factor of 4000, are the any
plans on enlarging that number?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Anton Rommerskirchen

Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer:
 On 19/11/2009 12:22 PM, Scott Carey wrote:
  3:  Have PG wait a half second (configurable) after the checkpoint
  fsync() completes before deleting/ overwriting any WAL segments.  This
  would be a trivial feature to add to a postgres release, I think.

 How does that help? It doesn't provide any guarantee that the data has
 hit main storage - it could lurk in SDD cache for hours.

  4: Yet another solution:  The drives DO adhere to write barriers
  properly. A filesystem that used these in the process of fsync() would be
  fine too. So XFS without LVM or MD (or the newer versions of those that
  don't ignore barriers) would work too.

 *if* the WAL is also on the SSD.

 If the WAL is on a separate drive, the write barriers do you no good,
 because they won't ensure that the data hits the main drive storage
 before the WAL recycling hits the WAL disk storage. The two drives
 operate independently and the write barriers don't interact.

 You'd need some kind of inter-drive write barrier.

 --
 Craig Ringer


Hello !

as i understand this:
ssd performace is great, but caching is the problem.

questions:

1. what about conventional disks with 32/64 mb cache ? how do they handle the 
plug test if their caches are on ?

2. what about using seperated power supply for the disks ? it it possible to 
write back the cache after switching the sata to another machine controller ?

3. what about making a statement about a lacking enterprise feature (aka 
emergency battery equipped ssd) and submitting this to the producers ?

I found that one of them (OCZ) seems to handle suggestions of customers (see 
write speed discussins on vertex fro example)

and another (intel) seems to handle serious problems with his disks in 
rewriting and sometimes redesigning his products - if you tell them and 
market dictades to react (see degeneration of performace before 1.11 
firmware).

perhaps its time to act and not only to complain about the fact.

(btw: found funny bonnie++ for my intel 160 gb postville and my samsung pb22 
after using the sam for now approx. 3 months+ ... my conclusion: NOT all SSD 
are equal ...)

best regards 

anton

-- 

ATRSoft GmbH
Bivetsweg 12
D 41542 Dormagen
Deutschland
Tel .: +49(0)2182 8339951
Mobil: +49(0)172 3490817

Geschäftsführer Anton Rommerskirchen

Köln HRB 44927
STNR 122/5701 - 2030
USTID DE213791450

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Brad Nicholson

On Thu, 2009-11-19 at 19:01 +0100, Anton Rommerskirchen wrote:
 Am Donnerstag, 19. November 2009 13:29:56 schrieb Craig Ringer:
  On 19/11/2009 12:22 PM, Scott Carey wrote:
   3:  Have PG wait a half second (configurable) after the checkpoint
   fsync() completes before deleting/ overwriting any WAL segments.  This
   would be a trivial feature to add to a postgres release, I think.
 
  How does that help? It doesn't provide any guarantee that the data has
  hit main storage - it could lurk in SDD cache for hours.
 
   4: Yet another solution:  The drives DO adhere to write barriers
   properly. A filesystem that used these in the process of fsync() would be
   fine too. So XFS without LVM or MD (or the newer versions of those that
   don't ignore barriers) would work too.
 
  *if* the WAL is also on the SSD.
 
  If the WAL is on a separate drive, the write barriers do you no good,
  because they won't ensure that the data hits the main drive storage
  before the WAL recycling hits the WAL disk storage. The two drives
  operate independently and the write barriers don't interact.
 
  You'd need some kind of inter-drive write barrier.
 
  --
  Craig Ringer
 
 
 Hello !
 
 as i understand this:
 ssd performace is great, but caching is the problem.
 
 questions:
 
 1. what about conventional disks with 32/64 mb cache ? how do they handle the 
 plug test if their caches are on ?

If the aren't battery backed, they can lose data.  This is not specific
to SSD.

 2. what about using seperated power supply for the disks ? it it possible to 
 write back the cache after switching the sata to another machine controller ?

Not sure.  I only use devices with battery backed caches or no cache.  I
would be concerned however about the drive not flushing itself and still
running out of power.

 3. what about making a statement about a lacking enterprise feature (aka 
 emergency battery equipped ssd) and submitting this to the producers ?

The producers aren't making Enterprise products, they are using caches
to accelerate the speeds of consumer products to make their drives more
appealing to consumers.  They aren't going to slow them down to make
them more reliable, especially when the core consumer doesn't know about
this issue, and is even less likely to understand it if explained.

They may stamp the word Enterprise on them, but it's nothing more than
marketing.

 I found that one of them (OCZ) seems to handle suggestions of customers (see 
 write speed discussins on vertex fro example)
 
 and another (intel) seems to handle serious problems with his disks in 
 rewriting and sometimes redesigning his products - if you tell them and 
 market dictades to react (see degeneration of performace before 1.11 
 firmware).
 
 perhaps its time to act and not only to complain about the fact.

Or, you could just buy higher quality equipment that was designed with
this in mind.

There is nothing unique to SSD here IMHO.  I wouldn't run my production
grade databases on consumer grade HDD, I wouldn't run them on consumer
grade SSD either.


-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith


Scott Carey wrote:

Have PG wait a half second (configurable) after the checkpoint fsync()
completes before deleting/ overwriting any WAL segments.  This would be a
trivial feature to add to a postgres release, I think.  Actually, it
already exists!  Turn on log archiving, and have the script that it runs after 
a checkpoint sleep().
  
That won't help.  Once the checkpoint is done, the problem isn't just 
that the WAL segments are recycled.  The server isn't going to use them 
even if they were there.  The reason why you can erase/recycle them is 
that you're doing so *after* writing out a checkpoint record that says 
you don't have to ever look at them again.  What you'd actually have to 
do is hack the server code to insert that delay after every fsync--there 
are none that you can cheat on and not introduce a corruption 
possibility.  The whole WAL/recovery mechanism in PostgreSQL doesn't 
make a lot of assumptions about what the underlying disk has to actually 
do beyond the fsync requirement; the flip side to that robustness is 
that it's the one you can't ever violate safely.

BTW, the information I have seen indicates that the write cache is 256K on
the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
space).
  
Right.  It's not used like the write-cache on a regular hard drive, 
where they're buffering 8MB-32MB worth of writes just to keep seek 
overhead down.  It's there primarily to allow combining writes into 
large chunks, to better match the block size of the underlying SSD flash 
cells (128K).  Having enough space for two full cells allows spooling 
out the flash write to a whole block while continuing to buffer the next 
one.


This is why turning the cache off can tank performance so badly--you're 
going to be writing a whole 128K block no matter what if it's force to 
disk without caching, even if it's just to write a 8K page to it.  
That's only going to reach 1/16 of the usual write speed on single page 
writes.  And that's why you should also be concerned at whether 
disabling the write cache impacts the drive longevity, lots of small 
writes going out in small chunks is going to wear flash out much faster 
than if the drive is allowed to wait until it's got a full sized block 
to write every time.


The fact that the cache is so small is also why it's harder to catch the 
drive doing the wrong thing here.  The plug test is pretty sensitive to 
a problem when you've got megabytes worth of cached writes that are 
spooling to disk at spinning hard drive speeds.  The window for loss on 
a SSD with no seek overhead and only a moderate number of KB worth of 
cached data is much, much smaller.  Doesn't mean it's gone though.  It's 
a shame that the design wasn't improved just a little bit; a cheap 
capacitor and blocking new writes once the incoming power dropped is all 
it would take to make these much more reliable for database use.  But 
that would raise the price, and not really help anybody but the small 
subset of the market that cares about durable writes.

4: Yet another solution:  The drives DO adhere to write barriers properly.
A filesystem that used these in the process of fsync() would be fine too.
So XFS without LVM or MD (or the newer versions of those that don't ignore
barriers) would work too.
  
If I really trusted anything beyond the very basics of the filesystem to 
really work well on Linux, this whole issue would be moot for most of 
the production deployments I do.  Ideally, fsync would just push out the 
minimum of what's needed, it would call the appropriate write cache 
flush mechanism the way the barrier implementation does when that all 
works, life would be good.  Alternately, you might even switch to using 
O_SYNC writes instead, which on a good filesystem implementation are 
both accelerated and safe compared to write/fsync (I've seen that work 
as expected on Vertias VxFS for example). 

Meanwhile, in the actual world we live, patches that make writes more 
durable by default are dropped by the Linux community because they tank 
performance for too many types of loads, I'm frightened to turn on 
O_SYNC at all on ext3 because of reports of corruption on the lists 
here, fsync does way more work than it needs to, and the way the 
filesystem and block drivers have been separated makes it difficult to 
do any sort of device write cache control from userland.  This is why I 
try to use the simplest, best tested approach out there whenever possible.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Greg Smith


Scott Marlowe wrote:

On Thu, Nov 19, 2009 at 10:01 AM, Merlin Moncure mmonc...@gmail.com wrote:
  

pgbench is actually a pretty awesome i/o tester assuming you have big
enough scaling factor

Seeing as how pgbench only goes to scaling factor of 4000, are the any
plans on enlarging that number?
  
I'm doing pgbench tests now on a system large enough for this limit to 
matter, so I'm probably going to have to fix that for 8.5 just to 
complete my own work.


You can use pgbench to either get interesting peak read results, or peak 
write ones, but it's not real useful for things in between.  The 
standard test basically turns into a huge stack of writes to a single 
table, and the select-only one is interesting to gauge either cached or 
uncached read speed (depending on the scale).  It's not very useful for 
getting a feel for how something with a mixed read/write workload does 
though, which is unfortunate because I think that scenario is much more 
common than what it does test.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Merlin Moncure

On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith g...@2ndquadrant.com wrote:
 You can use pgbench to either get interesting peak read results, or peak
 write ones, but it's not real useful for things in between.  The standard
 test basically turns into a huge stack of writes to a single table, and the
 select-only one is interesting to gauge either cached or uncached read speed
 (depending on the scale).  It's not very useful for getting a feel for how
 something with a mixed read/write workload does though, which is unfortunate
 because I think that scenario is much more common than what it does test.

all true, but it's pretty easy to rig custom (-f) commands for
virtually any test you want,.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-19 Thread Scott Marlowe

On Thu, Nov 19, 2009 at 2:39 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Nov 19, 2009 at 4:10 PM, Greg Smith g...@2ndquadrant.com wrote:
 You can use pgbench to either get interesting peak read results, or peak
 write ones, but it's not real useful for things in between.  The standard
 test basically turns into a huge stack of writes to a single table, and the
 select-only one is interesting to gauge either cached or uncached read speed
 (depending on the scale).  It's not very useful for getting a feel for how
 something with a mixed read/write workload does though, which is unfortunate
 because I think that scenario is much more common than what it does test.

 all true, but it's pretty easy to rig custom (-f) commands for
 virtually any test you want,.

My primary use of pgbench is to exercise a machine as a part of
acceptance testing.  After using it to do power plug pulls, I run it
for a week or two to exercise the drive array and controller mainly.
Any machine that runs smooth for a week with a load factor of 20 or 30
and the amount of updates that pgbench generates don't overwhelm it
I'm pretty happy.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-18 Thread Kenny Gorman

I found a bit of time to play with this.

I started up a test with 20 concurrent processes all inserting into
the same table and committing after each insert. The db was achieving
about 5000 inserts per second, and I kept it running for about 10
minutes. The host was doing about 5MB/s of Physical I/O to the Fusion
IO drive. I set checkpoint segments very small (10). I observed the
following message in the log: checkpoints are occurring too frequently
(16 seconds apart). Then I pulled the cord. On reboot I noticed that
Fusion IO replayed it's log, then the filesystem (vxfs) did the same.
Then I started up the DB and observed the it perform auto-recovery:

Nov 18 14:33:53 frutestdb002 postgres[5667]: [6-1] 2009-11-18 14:33:53
PSTLOG: database system was not properly shut down; automatic
recovery in progress
Nov 18 14:33:53 frutestdb002 postgres[5667]: [7-1] 2009-11-18 14:33:53
PSTLOG: redo starts at 2A/55F9D478
Nov 18 14:33:54 frutestdb002 postgres[5667]: [8-1] 2009-11-18 14:33:54
PSTLOG: record with zero length at 2A/56692F38
Nov 18 14:33:54 frutestdb002 postgres[5667]: [9-1] 2009-11-18 14:33:54
PSTLOG: redo done at 2A/56692F08
Nov 18 14:33:54 frutestdb002 postgres[5667]: [10-1] 2009-11-18
14:33:54 PSTLOG: database system is ready

Thanks
Kenny

On Nov 13, 2009, at 1:35 PM, Kenny Gorman wrote:

The FusionIO products are a little different. They are card based
vs trying to emulate a traditional disk. In terms of volatility,
they have an on-board capacitor that allows power to be supplied
until all writes drain. They do not have a cache in front of them
like a disk-type SSD might. I don't sell these things, I am just a
fan. I verified all this with the Fusion IO techs before I
replied. Perhaps older versions didn't have this functionality? I
am not sure. I have already done some cold power off tests w/o
problems, but I could up the workload a bit and retest. I will do a
couple of 'pull the cable' tests on monday or tuesday and report
back how it goes.

Re the performance #'s... Here is my post:

http://www.kennygorman.com/wordpress/?p=398

-kg

In order for a drive to work reliably for database use such as for
PostgreSQL, it cannot have a volatile write cache. You either need a
write cache with a battery backup (and a UPS doesn't count), or to
turn
the cache off. The SSD performance figures you've been looking at
are
with the drive's write cache turned on, which means they're
completely

fictitious and exaggerated upwards for your purposes. In the real
world, that will result in database corruption after a crash one day.
No one on the drive benchmarking side of the industry seems to have
picked up on this, so you can't use any of those figures. I'm not
even
sure right now whether drives like Intel's will even meet their
lifetime
expectations if they aren't allowed to use their internal volatile
write

cache.

Here's two links you should read and then reconsider your whole
design:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/
http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html

database market are really different, and if you believe doom
forecasting like the comments at
http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
that gap is widening, not shrinking.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey




On 11/13/09 10:21 AM, Karl Denninger k...@denninger.net wrote:

 
 One caution for those thinking of doing this - the incremental
 improvement of this setup on PostGresql in WRITE SIGNIFICANT environment
 isn't NEARLY as impressive.  Indeed the performance in THAT case for
 many workloads may only be 20 or 30% faster than even reasonably
 pedestrian rotating media in a high-performance (lots of spindles and
 thus stripes) configuration and it's more expensive (by a lot.)  If you
 step up to the fast SAS drives on the rotating side there's little
 argument for the SSD at all (again, assuming you don't intend to cheat
 and risk data loss.)

For your database DATA disks, leaving the write cache on is 100% acceptable,
even with power loss, and without a RAID controller.  And even in high write
environments.

That is what the XLOG is for, isn't it?  That is where this behavior is
critical.  But that has completely different performance requirements and
need not bee on the same volume, array, or drive.

 
 Know your application and benchmark it.
 
 -- Karl
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey


On 11/15/09 12:46 AM, Craig Ringer cr...@postnewspapers.com.au wrote:
 Possible fixes for this are:
 
 - Don't let the drive lie about cache flush operations, ie disable write
 buffering.
 
 - Give Pg some way to find out, from the drive, when particular write
 operations have actually hit disk. AFAIK there's no such mechanism at
 present, and I don't think the drives are even capable of reporting this
 data. If they were, Pg would have to be capable of applying entries from
 the WAL sparsely to account for the way the drive's write cache
 commits changes out-of-order, and Pg would have to maintain a map of
 committed / uncommitted WAL records. Pg would need another map of
 tablespace blocks to WAL records to know, when a drive write cache
 commit notice came in, what record in what WAL archive was affected.
 It'd also require Pg to keep WAL archives for unbounded and possibly
 long periods of time, making disk space management for WAL much harder.
 So - not easy is a bit of an understatement here.

3:  Have PG wait a half second (configurable) after the checkpoint fsync()
completes before deleting/ overwriting any WAL segments.  This would be a
trivial feature to add to a postgres release, I think.  Actually, it
already exists!

Turn on log archiving, and have the script that it runs after a checkpoint
sleep().

BTW, the information I have seen indicates that the write cache is 256K on
the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
space).

4: Yet another solution:  The drives DO adhere to write barriers properly.
A filesystem that used these in the process of fsync() would be fine too.
So XFS without LVM or MD (or the newer versions of those that don't ignore
barriers) would work too.

So, I think that write caching may not be necessary to turn off for non-xlog
disk.

 
 You still need to turn off write caching.
 
 --
 Craig Ringer
 
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-18 Thread Tom Lane

Scott Carey sc...@richrelevance.com writes:
 For your database DATA disks, leaving the write cache on is 100% acceptable,
 even with power loss, and without a RAID controller.  And even in high write
 environments.

Really?  How hard have you tested that configuration?

 That is what the XLOG is for, isn't it?

Once we have fsync'd a data change, we discard the relevant XLOG
entries.  If the disk hasn't actually put the data on stable storage
before it claims the fsync is done, you're screwed.

XLOG only exists to centralize the writes that have to happen before
a transaction can be reported committed (in particular, to avoid a
lot of random-access writes at commit).  It doesn't make any
fundamental change in the rules of the game: a disk that lies about
write complete will still burn you.

In a zero-seek-cost environment I suspect that XLOG wouldn't actually
be all that useful.  I gather from what's been said earlier that SSDs
don't fully eliminate random-access penalties, though.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey


On 11/17/09 10:51 AM, Greg Smith g...@2ndquadrant.com wrote:

 Merlin Moncure wrote:
 I am right now talking to someone on postgresql irc who is measuring
 15k iops from x25-e and no data loss following power plug test.
 The funny thing about Murphy is that he doesn't visit when things are
 quiet.  It's quite possible the window for data loss on the drive is
 very small.  Maybe you only see it one out of 10 pulls with a very
 aggressive database-oriented write test.  Whatever the odd conditions
 are, you can be sure you'll see them when there's a bad outage in actual
 production though.

Yes, but there is nothing fool proof.  Murphy visited me recently, and the
RAID card with BBU cache that the WAL logs were on crapped out.  Data was
fine.

Had to fix up the system without any WAL logs.  Luckily, out of 10TB, only
200GB or so of it could have been in the process of writing (yay!
partitioning by date!) to and we could restore just that part rather than
initiating a full restore.
Then there was fun times in single user mode to fix corrupted system tables
(about half the system indexes were dead, and the statistics table was
corrupt, but that could be truncated safely).

Its all fine now with all data validated.

Moral of the story:  Nothing is 100% safe, so sometimes a small bit of KNOWN
risk is perfectly fine.  There is always UNKNOWN risk.  If one risks losing
256K of cached data on an SSD if you're really unlucky with timing, how
dangerous is that versus the chance that the raid card or other hardware
barfs and takes out your whole WAL?

Nothing is safe enough to avoid a full DR plan of action.  The individual
tradeoffs are very application and data dependent.


 
 A good test program that is a bit better at introducing and detecting
 the write cache issue is described at
 http://brad.livejournal.com/2116715.html
 
 --
 Greg Smith2ndQuadrant   Baltimore, MD
 PostgreSQL Training, Services and Support
 g...@2ndquadrant.com  www.2ndQuadrant.com
 
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-18 Thread Scott Carey


On 11/17/09 10:58 PM, da...@lang.hm da...@lang.hm wrote:
 
 keep in mind that bonnie++ isn't always going to reflect your real
 performance.
 
 I have run tests on some workloads that were definantly I/O limited where
 bonnie++ results that differed by a factor of 10x made no measurable
 difference in the application performance, so I can easily believe in
 cases where bonnie++ numbers would not change but application performance
 could be drasticly different.
 

Well, that is sort of true for all benchmarks, but I do find that bonnie++
is the worst of the bunch.  I consider it relatively useless compared to
fio.  Its just not a great benchmark for server type load and I find it
lacking in the ability to simulate real applications.


 as always it can depend heavily on your workload. you really do need to
 figure out how to get your hands on one for your own testing.
 
 David Lang
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Merlin Moncure

2009/11/13 Greg Smith g...@2ndquadrant.com:
 As far as what real-world apps have that profile, I like SSDs for small to
 medium web applications that have to be responsive, where the user shows up
 and wants their randomly distributed and uncached data with minimal latency.
 SSDs can also be used effectively as second-tier targeted storage for things
 that have a performance-critical but small and random bit as part of a
 larger design that doesn't have those characteristics; putting indexes on
 SSD can work out well for example (and there the write durability stuff
 isn't quite as critical, as you can always drop an index and rebuild if it
 gets corrupted).

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.  I am
becoming increasingly suspicious that peter's results are not
representative: given that 90% of bonnie++ seeks are read only, the
math doesn't add up, and they contradict broadly published tests on
the internet.  Has anybody independently verified the results?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Brad Nicholson

On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
 2009/11/13 Greg Smith g...@2ndquadrant.com:
  As far as what real-world apps have that profile, I like SSDs for small to
  medium web applications that have to be responsive, where the user shows up
  and wants their randomly distributed and uncached data with minimal latency.
  SSDs can also be used effectively as second-tier targeted storage for things
  that have a performance-critical but small and random bit as part of a
  larger design that doesn't have those characteristics; putting indexes on
  SSD can work out well for example (and there the write durability stuff
  isn't quite as critical, as you can always drop an index and rebuild if it
  gets corrupted).
 
 I am right now talking to someone on postgresql irc who is measuring
 15k iops from x25-e and no data loss following power plug test.  I am
 becoming increasingly suspicious that peter's results are not
 representative: given that 90% of bonnie++ seeks are read only, the
 math doesn't add up, and they contradict broadly published tests on
 the internet.  Has anybody independently verified the results?

How many times have the run the plug test?  I've read other reports of
people (not on Postgres) losing data on this drive with the write cache
on.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Scott Marlowe

On Tue, Nov 17, 2009 at 9:54 AM, Brad Nicholson
bnich...@ca.afilias.info wrote:
 On Tue, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
 2009/11/13 Greg Smith g...@2ndquadrant.com:
  As far as what real-world apps have that profile, I like SSDs for small to
  medium web applications that have to be responsive, where the user shows up
  and wants their randomly distributed and uncached data with minimal 
  latency.
  SSDs can also be used effectively as second-tier targeted storage for 
  things
  that have a performance-critical but small and random bit as part of a
  larger design that doesn't have those characteristics; putting indexes on
  SSD can work out well for example (and there the write durability stuff
  isn't quite as critical, as you can always drop an index and rebuild if it
  gets corrupted).

 I am right now talking to someone on postgresql irc who is measuring
 15k iops from x25-e and no data loss following power plug test.  I am
 becoming increasingly suspicious that peter's results are not
 representative: given that 90% of bonnie++ seeks are read only, the
 math doesn't add up, and they contradict broadly published tests on
 the internet.  Has anybody independently verified the results?

 How many times have the run the plug test?  I've read other reports of
 people (not on Postgres) losing data on this drive with the write cache
 on.

When I run the plug test it's on a pgbench that's as big as possible
(~4000) and I remove memory if there's a lot in the server so the
memory is smaller than the db.  I run 100+ concurrent and I set
checkoint timeouts to 30 minutes, and make a lots of checkpoint
segments (100 or so), and set completion target to 0.  Then after
about 1/2 checkpoint timeout has passed, I issue a checkpoint from the
command line, take a deep breath and pull the cord.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Peter Eisentraut

On tis, 2009-11-17 at 11:36 -0500, Merlin Moncure wrote:
 I am right now talking to someone on postgresql irc who is measuring
 15k iops from x25-e and no data loss following power plug test.  I am
 becoming increasingly suspicious that peter's results are not
 representative: given that 90% of bonnie++ seeks are read only, the
 math doesn't add up, and they contradict broadly published tests on
 the internet.  Has anybody independently verified the results?

Notably, between my two blog posts and this email thread, there have
been claims of

400
1800
4000
7000
14000
15000
35000

iops (of some kind) per second.

That alone should be cause of concern.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Greg Smith


Merlin Moncure wrote:

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.
The funny thing about Murphy is that he doesn't visit when things are 
quiet.  It's quite possible the window for data loss on the drive is 
very small.  Maybe you only see it one out of 10 pulls with a very 
aggressive database-oriented write test.  Whatever the odd conditions 
are, you can be sure you'll see them when there's a bad outage in actual 
production though.


A good test program that is a bit better at introducing and detecting 
the write cache issue is described at 
http://brad.livejournal.com/2116715.html


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Merlin Moncure

On Tue, Nov 17, 2009 at 1:51 PM, Greg Smith g...@2ndquadrant.com wrote:
 Merlin Moncure wrote:

 I am right now talking to someone on postgresql irc who is measuring
 15k iops from x25-e and no data loss following power plug test.

 The funny thing about Murphy is that he doesn't visit when things are quiet.
  It's quite possible the window for data loss on the drive is very small.
  Maybe you only see it one out of 10 pulls with a very aggressive
 database-oriented write test.  Whatever the odd conditions are, you can be
 sure you'll see them when there's a bad outage in actual production though.

 A good test program that is a bit better at introducing and detecting the
 write cache issue is described at http://brad.livejournal.com/2116715.html

Sure, not disputing that...I don't have one to test myself, so I can't
vouch for the data being safe.  But what's up with the 400 iops
measured from bonnie++?  That's an order of magnitude slower than any
other published benchmark on the 'net, and I'm dying to get a little
clarification here.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Mark Mielke


On 11/17/2009 01:51 PM, Greg Smith wrote:

Merlin Moncure wrote:

I am right now talking to someone on postgresql irc who is measuring
15k iops from x25-e and no data loss following power plug test.
The funny thing about Murphy is that he doesn't visit when things are 
quiet.  It's quite possible the window for data loss on the drive is 
very small.  Maybe you only see it one out of 10 pulls with a very 
aggressive database-oriented write test.  Whatever the odd conditions 
are, you can be sure you'll see them when there's a bad outage in 
actual production though.


A good test program that is a bit better at introducing and detecting 
the write cache issue is described at 
http://brad.livejournal.com/2116715.html




I've been following this thread with great interest in your results... 
Please continue to share...


For write cache issues - is it possible that the reduced power 
utilization of SSD allows for a capacitor to complete all scheduled 
writes, even with a large cache? Is it this particular drive you are 
suggesting that is known to be insufficient or is it really the 
technology or maturity of the technology?


Cheers,
mark

--
Mark Mielkem...@mielke.cc


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread Greg Smith


Merlin Moncure wrote:
But what's up with the 400 iops measured from bonnie++?  
I don't know really.  SSD writes are really sensitive to block size and 
the ability to chunk writes into larger chunks, so it may be that Peter 
has just found the worst-case behavior and everybody else is seeing 
something better than that.


When the reports I get back from people I believe are competant--Vadim, 
Peter--show worst-case results that are lucky to beat RAID10, I feel I 
have to dismiss the higher values reported by people who haven't been so 
careful.  And that's just about everybody else, which leaves me quite 
suspicious of the true value of the drives.  The whole thing really sets 
off my vendor hype reflex, and short of someone loaning me a drive to 
test I'm not sure how to get past that.  The Intel drives are still just 
a bit too expensive to buy one on a whim, such that I'll just toss it if 
the drive doesn't live up to expectations.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-17 Thread david


On Wed, 18 Nov 2009, Greg Smith wrote:


Merlin Moncure wrote:
But what's up with the 400 iops measured from bonnie++? 
I don't know really.  SSD writes are really sensitive to block size and the 
ability to chunk writes into larger chunks, so it may be that Peter has just 
found the worst-case behavior and everybody else is seeing something better 
than that.


When the reports I get back from people I believe are competant--Vadim, 
Peter--show worst-case results that are lucky to beat RAID10, I feel I have 
to dismiss the higher values reported by people who haven't been so careful. 
And that's just about everybody else, which leaves me quite suspicious of the 
true value of the drives.  The whole thing really sets off my vendor hype 
reflex, and short of someone loaning me a drive to test I'm not sure how to 
get past that.  The Intel drives are still just a bit too expensive to buy 
one on a whim, such that I'll just toss it if the drive doesn't live up to 
expectations.


keep in mind that bonnie++ isn't always going to reflect your real 
performance.


I have run tests on some workloads that were definantly I/O limited where 
bonnie++ results that differed by a factor of 10x made no measurable 
difference in the application performance, so I can easily believe in 
cases where bonnie++ numbers would not change but application performance 
could be drasticly different.


as always it can depend heavily on your workload. you really do need to 
figure out how to get your hands on one for your own testing.


David Lang

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-15 Thread Craig Ringer

On 15/11/2009 11:57 AM, Laszlo Nagy wrote:

 Ok, I'm getting confused here. There is the WAL, which is written
 sequentially. If the WAL is not corrupted, then it can be replayed on
 next database startup. Please somebody enlighten me! In my mind, fsync
 is only needed for the WAL. If I could configure postgresql to put the
 WAL on a real hard drive that has BBU and write cache, then I cannot
 loose data. Meanwhile, product table data could be placed on the SSD
 drive, and I sould be able to turn on write cache safely. Am I wrong?

A change has been written to the WAL and fsync()'d, so Pg knows it's hit
disk. It can now safely apply the change to the tables themselves, and
does so, calling fsync() to tell the drive containing the tables to
commit those changes to disk.

The drive lies, returning success for the fsync when it's just cached
the data in volatile memory. Pg carries on, shortly deleting the WAL
archive the changes were recorded in or recycling it and overwriting it
with new change data. The SSD is still merrily buffering data to write
cache, and hasn't got around to writing your particular change yet.

The machine loses power.

Oops! A hole just appeared in history. A WAL replay won't re-apply the
changes that the database guaranteed had hit disk, but the changes never
made it onto the main database storage.

Possible fixes for this are:

- Don't let the drive lie about cache flush operations, ie disable write
buffering.

- Give Pg some way to find out, from the drive, when particular write
operations have actually hit disk. AFAIK there's no such mechanism at
present, and I don't think the drives are even capable of reporting this
data. If they were, Pg would have to be capable of applying entries from
the WAL sparsely to account for the way the drive's write cache
commits changes out-of-order, and Pg would have to maintain a map of
committed / uncommitted WAL records. Pg would need another map of
tablespace blocks to WAL records to know, when a drive write cache
commit notice came in, what record in what WAL archive was affected.
It'd also require Pg to keep WAL archives for unbounded and possibly
long periods of time, making disk space management for WAL much harder.
So - not easy is a bit of an understatement here.

You still need to turn off write caching.

--
Craig Ringer


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-15 Thread Laszlo Nagy




A change has been written to the WAL and fsync()'d, so Pg knows it's hit
disk. It can now safely apply the change to the tables themselves, and
does so, calling fsync() to tell the drive containing the tables to
commit those changes to disk.

The drive lies, returning success for the fsync when it's just cached
the data in volatile memory. Pg carries on, shortly deleting the WAL
archive the changes were recorded in or recycling it and overwriting it
with new change data. The SSD is still merrily buffering data to write
cache, and hasn't got around to writing your particular change yet.
  
All right. I believe you. In the current Pg implementation, I need to 
turn of disk cache.


But I would like to ask some theoretical questions. It is just an 
idea from me, and probably I'm wrong.

Here is a scenario:

#1. user wants to change something, resulting in a write_to_disk(data) call
#2. data is written into the WAL and fsync()-ed
#3. at this point the write_to_disk(data) call CAN RETURN, the user can 
continue his work (the WAL is already written, changes cannot be lost)

#4. Pg can continue writting data onto the disk, and fsync() it.
#5. Then WAL archive data can be deleted.

Now maybe I'm wrong, but between #3 and #5, the data to be written is 
kept in memory. This is basically a write cache, implemented in OS 
memory. We could really handle it like a write cache. E.g. everything 
would remain the same, except that we add some latency. We can wait some 
time after the last modification of a given block, and then write it out.


Is it possible to do? If so, then can we can turn off write cache for 
all drives, except the one holding the WAL. And still write speed would 
remain the same. I don't think that any SSD drive has more than some 
megabytes of write cache. The same amount of write cache could easily be 
implemented in OS memory, and then Pg would always know what hit the disk.


Thanks,

  Laci


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-15 Thread Craig Ringer

On 15/11/2009 2:05 PM, Laszlo Nagy wrote:
 
 A change has been written to the WAL and fsync()'d, so Pg knows it's hit
 disk. It can now safely apply the change to the tables themselves, and
 does so, calling fsync() to tell the drive containing the tables to
 commit those changes to disk.

 The drive lies, returning success for the fsync when it's just cached
 the data in volatile memory. Pg carries on, shortly deleting the WAL
 archive the changes were recorded in or recycling it and overwriting it
 with new change data. The SSD is still merrily buffering data to write
 cache, and hasn't got around to writing your particular change yet.
   
 All right. I believe you. In the current Pg implementation, I need to
 turn of disk cache.

That's certainly my understanding. I've been wrong many times before :S

 #1. user wants to change something, resulting in a write_to_disk(data) call
 #2. data is written into the WAL and fsync()-ed
 #3. at this point the write_to_disk(data) call CAN RETURN, the user can
 continue his work (the WAL is already written, changes cannot be lost)
 #4. Pg can continue writting data onto the disk, and fsync() it.
 #5. Then WAL archive data can be deleted.
 
 Now maybe I'm wrong, but between #3 and #5, the data to be written is
 kept in memory. This is basically a write cache, implemented in OS
 memory. We could really handle it like a write cache. E.g. everything
 would remain the same, except that we add some latency. We can wait some
 time after the last modification of a given block, and then write it out.

I don't know enough about the whole affair to give you a good
explanation ( I tried, and it just showed me how much I didn't know )
but here are a few issues:

- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.

- The only way to make this help the SSD out much would be to use a LOT
of RAM for write cache and maintain a LOT of WAL archives. That's RAM
not being used for caching read data. The large number of WAL archives
means incredibly long WAL replay times after a crash.

- You still need a reliable way to tell the SSD really flush your cache
now after you've flushed the changes from your huge chunks of WAL files
and are getting ready to recycle them.

I was thinking that write ordering would be an issue too, as some
changes in the WAL would hit main disk before others that were earlier
in the WAL. However, I don't think that matters if full_page_writes are
on. If you replay from the start, you'll reapply some changes with older
versions, but they'll be corrected again by a later WAL record. So
ordering during WAL replay shouldn't be a problem. On the other hand,
the INCREDIBLY long WAL replay times during recovery would be a nightmare.

 I don't think that any SSD drive has more than some
 megabytes of write cache.

The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.

 The same amount of write cache could easily be
 implemented in OS memory, and then Pg would always know what hit the disk.

Really? How does Pg know what order the SSD writes things out from its
cache?

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-15 Thread Laszlo Nagy




- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.
  

Okay, I see. We cannot query erase block size from an SSD drive. :-(

I don't think that any SSD drive has more than some
megabytes of write cache.



The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.
  

Heh, this is why they are so expensive. :-)

The same amount of write cache could easily be
implemented in OS memory, and then Pg would always know what hit the disk.



Really? How does Pg know what order the SSD writes things out from its
cache?
  
I got the point. We cannot implement an efficient write cache without 
much more knowledge about how that particular drive works.


So... the only solution that works well is to have much more RAM for 
read cache, and much more RAM for write cache inside the RAID controller 
(with BBU).


Thank you,

  Laszlo


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-15 Thread Craig James


I've wondered whether this would work for a read-mostly application: Buy a big 
RAM machine, like 64GB, with a crappy little single disk.  Build the database, 
then make a really big RAM disk, big enough to hold the DB and the WAL.  Then 
build a duplicate DB on another machine with a decent disk (maybe a 4-disk 
RAID10), and turn on WAL logging.

The system would be blazingly fast, and you'd just have to be sure before you 
shut it off to shut down Postgres and copy the RAM files back to the regular 
disk.  And if you didn't, you could always recover from the backup.  Since it's 
a read-mostly system, the WAL logging bandwidth wouldn't be too high, so even a 
modest machine would be able to keep up.

Any thoughts?

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-15 Thread Heikki Linnakangas

Craig James wrote:
 I've wondered whether this would work for a read-mostly application: Buy
 a big RAM machine, like 64GB, with a crappy little single disk.  Build
 the database, then make a really big RAM disk, big enough to hold the DB
 and the WAL.  Then build a duplicate DB on another machine with a decent
 disk (maybe a 4-disk RAID10), and turn on WAL logging.
 
 The system would be blazingly fast, and you'd just have to be sure
 before you shut it off to shut down Postgres and copy the RAM files back
 to the regular disk.  And if you didn't, you could always recover from
 the backup.  Since it's a read-mostly system, the WAL logging bandwidth
 wouldn't be too high, so even a modest machine would be able to keep up.

Should work, but I don't see any advantage over attaching the RAID array
directly to the 1st machine with the RAM and turning synchronous_commit=off.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Lists


Laszlo Nagy wrote:

Hello,

I'm about to buy SSD drive(s) for a database. For decision making, I 
used this tech report:


http://techreport.com/articles.x/16255/9
http://techreport.com/articles.x/16255/10

Here are my concerns:

   * I need at least 32GB disk space. So DRAM based SSD is not a real
 option. I would have to buy 8x4GB memory, costs a fortune. And
 then it would still not have redundancy.
   * I could buy two X25-E drives and have 32GB disk space, and some
 redundancy. This would cost about $1600, not counting the RAID
 controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). Not 
a cheap solution, but the performance is amazing.



   * I could also buy many cheaper MLC SSD drives. They cost about
 $140. So even with 10 drives, I'm at $1400. I could put them in
 RAID6, have much more disk space (256GB), high redundancy and
 POSSIBLY good read/write speed. Of course then I need to buy a
 good RAID controller.

My question is about the last option. Are there any good RAID cards 
that are optimized (or can be optimized) for SSD drives? Do any of you 
have experience in using many cheaper SSD drives? Is it a bad idea?


Thank you,

  Laszlo





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Ivan Voras


Lists wrote:

Laszlo Nagy wrote:

Hello,

I'm about to buy SSD drive(s) for a database. For decision making, I 
used this tech report:


http://techreport.com/articles.x/16255/9
http://techreport.com/articles.x/16255/10

Here are my concerns:

   * I need at least 32GB disk space. So DRAM based SSD is not a real
 option. I would have to buy 8x4GB memory, costs a fortune. And
 then it would still not have redundancy.
   * I could buy two X25-E drives and have 32GB disk space, and some
 redundancy. This would cost about $1600, not counting the RAID
 controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). Not 
a cheap solution, but the performance is amazing.


I've came across this article:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/

It's from a Linux MySQL user so it's a bit confusing but it looks like 
he has some reservations about performance vs reliability of the Intel 
drives - apparently they have their own write cache and when it's 
disabled performance drops sharply.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Heikki Linnakangas

Merlin Moncure wrote:
 2009/11/13 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
 Laszlo Nagy wrote:
* I need at least 32GB disk space. So DRAM based SSD is not a real
  option. I would have to buy 8x4GB memory, costs a fortune. And
  then it would still not have redundancy.
 At 32GB database size, I'd seriously consider just buying a server with
 a regular hard drive or a small RAID array for redundancy, and stuffing
 16 or 32 GB of RAM into it to ensure everything is cached. That's tried
 and tested technology.
 
 lots of ram doesn't help you if:
 *) your database gets written to a lot and you have high performance
 requirements

When all the (hot) data is cached, all writes are sequential writes to
the WAL, with the occasional flushing of the data pages at checkpoint.
The sequential write bandwidth of SSDs and HDDs is roughly the same.

I presume the fsync latency is a lot higher with HDDs, so if you're
running a lot of small write transactions, and don't want to risk losing
any recently committed transactions by setting synchronous_commit=off,
the usual solution is to get a RAID controller with a battery-backed up
cache. With a BBU cache, the fsync latency should be in the same
ballpark as with SDDs.

 *) your data is important

Huh? The data is safely on the hard disk in case of a crash. The RAM is
just for caching.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Merlin Moncure

On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 lots of ram doesn't help you if:
 *) your database gets written to a lot and you have high performance
 requirements

 When all the (hot) data is cached, all writes are sequential writes to
 the WAL, with the occasional flushing of the data pages at checkpoint.
 The sequential write bandwidth of SSDs and HDDs is roughly the same.

 I presume the fsync latency is a lot higher with HDDs, so if you're
 running a lot of small write transactions, and don't want to risk losing
 any recently committed transactions by setting synchronous_commit=off,
 the usual solution is to get a RAID controller with a battery-backed up
 cache. With a BBU cache, the fsync latency should be in the same
 ballpark as with SDDs.

BBU raid controllers might only give better burst performance.  If you
are writing data randomly all over the volume, the cache will overflow
and performance will degrade.  Raid controllers degrade in different
fashions, at least one (perc 5) halted ALL access to the volume and
spun out the cache (a bug, IMO).

 *) your data is important

 Huh? The data is safely on the hard disk in case of a crash. The RAM is
 just for caching.

I was alluding to not being able to lose any transactions... in this
case you can only run fsync, synchronously.  You are then bound by the
capabilities of the volume to write, ram only buffers reads.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Heikki Linnakangas

Merlin Moncure wrote:
 On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 lots of ram doesn't help you if:
 *) your database gets written to a lot and you have high performance
 requirements
 When all the (hot) data is cached, all writes are sequential writes to
 the WAL, with the occasional flushing of the data pages at checkpoint.
 The sequential write bandwidth of SSDs and HDDs is roughly the same.

 I presume the fsync latency is a lot higher with HDDs, so if you're
 running a lot of small write transactions, and don't want to risk losing
 any recently committed transactions by setting synchronous_commit=off,
 the usual solution is to get a RAID controller with a battery-backed up
 cache. With a BBU cache, the fsync latency should be in the same
 ballpark as with SDDs.
 
 BBU raid controllers might only give better burst performance.  If you
 are writing data randomly all over the volume, the cache will overflow
 and performance will degrade.

We're discussing a scenario where all the data fits in RAM. That's what
the large amount of RAM is for. The only thing that's being written to
disk is the WAL, which is sequential, and the occasional flush of data
pages from the buffer cache at checkpoints, which doesn't happen often
and will be spread over a period of time.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Laszlo Nagy


Heikki Linnakangas wrote:

Laszlo Nagy wrote:
  

   * I need at least 32GB disk space. So DRAM based SSD is not a real
 option. I would have to buy 8x4GB memory, costs a fortune. And
 then it would still not have redundancy.



At 32GB database size, I'd seriously consider just buying a server with
a regular hard drive or a small RAID array for redundancy, and stuffing
16 or 32 GB of RAM into it to ensure everything is cached. That's tried
and tested technology.
  
32GB is for one table only. This server runs other applications, and you 
need to leave space for sort memory, shared buffers etc. Buying 128GB 
memory would solve the problem, maybe... but it is too expensive. And it 
is not safe. Power out - data loss.

I don't know how you came to the 32 GB figure, but keep in mind that
administration is a lot easier if you have plenty of extra disk space
for things like backups, dumps+restore, temporary files, upgrades etc.
  
This disk space would be dedicated for a smaller tablespace, holding one 
or two bigger tables with index scans. Of course I would never use an 
SSD disk for storing database backups. It would be waste of money.



 L


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Robert Haas

2009/11/14 Laszlo Nagy gand...@shopzeus.com:
 32GB is for one table only. This server runs other applications, and you
 need to leave space for sort memory, shared buffers etc. Buying 128GB memory
 would solve the problem, maybe... but it is too expensive. And it is not
 safe. Power out - data loss.

Huh?

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Merlin Moncure

On Sat, Nov 14, 2009 at 8:47 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Merlin Moncure wrote:
 On Sat, Nov 14, 2009 at 6:17 AM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 lots of ram doesn't help you if:
 *) your database gets written to a lot and you have high performance
 requirements
 When all the (hot) data is cached, all writes are sequential writes to
 the WAL, with the occasional flushing of the data pages at checkpoint.
 The sequential write bandwidth of SSDs and HDDs is roughly the same.

 I presume the fsync latency is a lot higher with HDDs, so if you're
 running a lot of small write transactions, and don't want to risk losing
 any recently committed transactions by setting synchronous_commit=off,
 the usual solution is to get a RAID controller with a battery-backed up
 cache. With a BBU cache, the fsync latency should be in the same
 ballpark as with SDDs.

 BBU raid controllers might only give better burst performance.  If you
 are writing data randomly all over the volume, the cache will overflow
 and performance will degrade.

 We're discussing a scenario where all the data fits in RAM. That's what
 the large amount of RAM is for. The only thing that's being written to
 disk is the WAL, which is sequential, and the occasional flush of data
 pages from the buffer cache at checkpoints, which doesn't happen often
 and will be spread over a period of time.

We are basically in agreement, but regardless of the effectiveness of
your WAL implementation, raid controller, etc, if you have to write
data to what approximates random locations to a disk based volume in a
sustained manner, you must eventually degrade to whatever the drive
can handle plus whatever efficiencies checkpoint, o/s, can gain by
grouping writes together.  Extra ram mainly helps only because it can
shave precious iops off the read side so you use them for writing.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Laszlo Nagy


Robert Haas wrote:

2009/11/14 Laszlo Nagy gand...@shopzeus.com:
  

32GB is for one table only. This server runs other applications, and you
need to leave space for sort memory, shared buffers etc. Buying 128GB memory
would solve the problem, maybe... but it is too expensive. And it is not
safe. Power out - data loss.

I'm sorry I though he was talking about keeping the database in memory 
with fsync=off. Now I see he was only talking about the OS disk cache.


My server has 24GB RAM, and I cannot easily expand it unless I throw out 
some 2GB modules, and buy more 4GB or 8GB modules. But... buying 4x8GB 
ECC RAM (+throwing out 4x2GB RAM) is a lot more expensive than buying 
some 64GB SSD drives. 95% of the table in question is not modified. Only 
read (mostly with index scan). Only 5% is actively updated.


This is why I think, using SSD in my case would be effective.

Sorry for the confusion.

 L


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-14 Thread Laszlo Nagy





   * I could buy two X25-E drives and have 32GB disk space, and some
 redundancy. This would cost about $1600, not counting the RAID
 controller. It is on the edge.
This was the solution I went with (4 drives in a raid 10 actually). 
Not a cheap solution, but the performance is amazing.


I've came across this article:

http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/ 



It's from a Linux MySQL user so it's a bit confusing but it looks like 
he has some reservations about performance vs reliability of the Intel 
drives - apparently they have their own write cache and when it's 
disabled performance drops sharply.
Ok, I'm getting confused here. There is the WAL, which is written 
sequentially. If the WAL is not corrupted, then it can be replayed on 
next database startup. Please somebody enlighten me! In my mind, fsync 
is only needed for the WAL. If I could configure postgresql to put the 
WAL on a real hard drive that has BBU and write cache, then I cannot 
loose data. Meanwhile, product table data could be placed on the SSD 
drive, and I sould be able to turn on write cache safely. Am I wrong?


 L


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Laszlo Nagy




Note that some RAID controllers (3Ware in particular) refuse to
recognize the MLC drives, in particular, they act as if the OCZ Vertex
series do not exist when connected.

I don't know what they're looking for (perhaps some indication that
actual rotation is happening?) but this is a potential problem make
sure your adapter can talk to these things!

BTW I have done some benchmarking with Postgresql against these drives
and they are SMOKING fast.
  
I was thinking about ARECA 1320 with 2GB memory + BBU. Unfortunately, I 
cannot find information about using ARECA cards with SSD drives. I'm 
also not sure how they would work together. I guess the RAID cards are 
optimized for conventional disks. They read/write data in bigger blocks 
and they optimize the order of reading/writing for physical cylinders. I 
know for sure that this particular areca card has an Intel dual core IO 
processor and its own embedded operating system. I guess it could be 
tuned for SSD drives, but I don't know how.


I was hoping that with a RAID 6 setup, write speed (which is slower for 
cheaper flash based SSD drives) would dramatically increase, because 
information written simultaneously to 10 drives. With very small block 
size, it would probably be true. But... what if the RAID card uses 
bigger block sizes, and - say - I want to update much smaller blocks in 
the database?


My other option is to buy two SLC SSD drives and use RAID1. It would 
cost about the same, but has less redundancy and less capacity. Which is 
the faster? 8-10 MLC disks in RAID 6 with a good caching controller, or 
two SLC disks in RAID1?


Thanks,

  Laszlo


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Marcos Ortiz Valmaseda


This is very fast.
On IT Toolbox there are many whitepapers about it.
On the ERP and DataCenter sections specifically.

We need that all tests that we do, we can share it on the
Project Wiki.

Regards

On Nov 13, 2009, at 7:02 AM, Karl Denninger wrote:


Laszlo Nagy wrote:

Hello,

I'm about to buy SSD drive(s) for a database. For decision making, I
used this tech report:

http://techreport.com/articles.x/16255/9
http://techreport.com/articles.x/16255/10

Here are my concerns:

  * I need at least 32GB disk space. So DRAM based SSD is not a real
option. I would have to buy 8x4GB memory, costs a fortune. And
then it would still not have redundancy.
  * I could buy two X25-E drives and have 32GB disk space, and some
redundancy. This would cost about $1600, not counting the RAID
controller. It is on the edge.
  * I could also buy many cheaper MLC SSD drives. They cost about
$140. So even with 10 drives, I'm at $1400. I could put them in
RAID6, have much more disk space (256GB), high redundancy and
POSSIBLY good read/write speed. Of course then I need to buy a
good RAID controller.

My question is about the last option. Are there any good RAID cards
that are optimized (or can be optimized) for SSD drives? Do any of  
you

have experience in using many cheaper SSD drives? Is it a bad idea?

Thank you,

 Laszlo


Note that some RAID controllers (3Ware in particular) refuse to
recognize the MLC drives, in particular, they act as if the OCZ Vertex
series do not exist when connected.

I don't know what they're looking for (perhaps some indication that
actual rotation is happening?) but this is a potential problem  
make

sure your adapter can talk to these things!

BTW I have done some benchmarking with Postgresql against these drives
and they are SMOKING fast.

-- Karl
karl.vcf
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org 
)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Scott Marlowe

2009/11/13 Laszlo Nagy gand...@shopzeus.com:
 Hello,

 I'm about to buy SSD drive(s) for a database. For decision making, I used
 this tech report:

 http://techreport.com/articles.x/16255/9
 http://techreport.com/articles.x/16255/10

 Here are my concerns:

   * I need at least 32GB disk space. So DRAM based SSD is not a real
     option. I would have to buy 8x4GB memory, costs a fortune. And
     then it would still not have redundancy.
   * I could buy two X25-E drives and have 32GB disk space, and some
     redundancy. This would cost about $1600, not counting the RAID
     controller. It is on the edge.

I'm not sure a RAID controller brings much of anything to the table with SSDs.

   * I could also buy many cheaper MLC SSD drives. They cost about
     $140. So even with 10 drives, I'm at $1400. I could put them in
     RAID6, have much more disk space (256GB), high redundancy and

I think RAID6 is gonna reduce the throughput due to overhead to
something far less than what a software RAID-10 would achieve.

     POSSIBLY good read/write speed. Of course then I need to buy a
     good RAID controller.

I'm guessing that if you spent whatever money you were gonna spend on
more SSDs you'd come out ahead, assuming you had somewhere to put
them.

 My question is about the last option. Are there any good RAID cards that are
 optimized (or can be optimized) for SSD drives? Do any of you have
 experience in using many cheaper SSD drives? Is it a bad idea?

This I don't know.  Some quick googling shows the Areca 1680ix and
Adaptec 5 Series to be able to handle Samsun SSDs.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Merlin Moncure

On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe scott.marl...@gmail.com wrote:
 I think RAID6 is gonna reduce the throughput due to overhead to
 something far less than what a software RAID-10 would achieve.

I was wondering about this.  I think raid 5/6 might be a better fit
for SSD than traditional drives arrays.  Here's my thinking:

*) flash SSD reads are cheaper than writes.  With 6 or more drives,
less total data has to be written in Raid 5 than Raid 10.  The main
component of raid 5 performance penalty is that for each written
block, it has to be read first than written...incurring rotational
latency, etc.   SSD does not have this problem.

*) flash is much more expensive in terms of storage/$.

*) flash (at least the intel stuff) is so fast relative to what we are
used to, that the point of using flash in raid is more for fault
tolerance than performance enhancement.  I don't have data to support
this, but I suspect that even with relatively small amount of the
slower MLC drives in raid, postgres will become cpu bound for most
applications.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Heikki Linnakangas

Laszlo Nagy wrote:
* I need at least 32GB disk space. So DRAM based SSD is not a real
  option. I would have to buy 8x4GB memory, costs a fortune. And
  then it would still not have redundancy.

At 32GB database size, I'd seriously consider just buying a server with
a regular hard drive or a small RAID array for redundancy, and stuffing
16 or 32 GB of RAM into it to ensure everything is cached. That's tried
and tested technology.

I don't know how you came to the 32 GB figure, but keep in mind that
administration is a lot easier if you have plenty of extra disk space
for things like backups, dumps+restore, temporary files, upgrades etc.
So if you think you'd need 32 GB of disk space, I'm guessing that 16 GB
of RAM would be enough to hold all the hot data in cache. And if you
choose a server with enough DIMM slots, you can expand easily if needed.

Just my 2 cents, I'm not really an expert on hardware..

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Merlin Moncure

2009/11/13 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
 Laszlo Nagy wrote:
    * I need at least 32GB disk space. So DRAM based SSD is not a real
      option. I would have to buy 8x4GB memory, costs a fortune. And
      then it would still not have redundancy.

 At 32GB database size, I'd seriously consider just buying a server with
 a regular hard drive or a small RAID array for redundancy, and stuffing
 16 or 32 GB of RAM into it to ensure everything is cached. That's tried
 and tested technology.

lots of ram doesn't help you if:
*) your database gets written to a lot and you have high performance
requirements
*) your data is important

(if either of the above is not true or even partially true, than your
advice is spot on)

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Scott Carey




On 11/13/09 7:29 AM, Merlin Moncure mmonc...@gmail.com wrote:

 On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe scott.marl...@gmail.com
 wrote:
 I think RAID6 is gonna reduce the throughput due to overhead to
 something far less than what a software RAID-10 would achieve.
 
 I was wondering about this.  I think raid 5/6 might be a better fit
 for SSD than traditional drives arrays.  Here's my thinking:
 
 *) flash SSD reads are cheaper than writes.  With 6 or more drives,
 less total data has to be written in Raid 5 than Raid 10.  The main
 component of raid 5 performance penalty is that for each written
 block, it has to be read first than written...incurring rotational
 latency, etc.   SSD does not have this problem.
 

For random writes, RAID 5 writes as much as RAID 10 (parity + data), and
more if the raid block size is larger than 8k.  With RAID 6 it writes 50%
more than RAID 10.

For streaming writes RAID 5 / 6 has an advantage however.

For SLC drives, there is really  not much of a write performance penalty.
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Karl Denninger

Greg Smith wrote:
In order for a drive to work reliably for database use such as for
PostgreSQL, it cannot have a volatile write cache. You either need a
write cache with a battery backup (and a UPS doesn't count), or to
turn the cache off. The SSD performance figures you've been looking
at are with the drive's write cache turned on, which means they're
completely fictitious and exaggerated upwards for your purposes. In
the real world, that will result in database corruption after a crash
one day.
If power is unexpectedly removed from the system, this is true. But
the caches on the SSD controllers are BUFFERS. An operating system
crash does not disrupt the data in them or cause corruption. An
unexpected disconnection of the power source from the drive (due to
unplugging it or a power supply failure for whatever reason) is a
different matter.
No one on the drive benchmarking side of the industry seems to have
picked up on this, so you can't use any of those figures. I'm not
even sure right now whether drives like Intel's will even meet their
lifetime expectations if they aren't allowed to use their internal
volatile write cache.

Here's two links you should read and then reconsider your whole design:
http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/

http://petereisentraut.blogspot.com/2009/07/solid-state-drive-benchmarks-and-write.html

I can't even imagine how bad the situation would be if you decide to
wander down the use a bunch of really cheap SSD drives path; these
things are barely usable for databases with Intel's hardware. The
needs of people who want to throw SSD in a laptop and those of the
enterprise database market are really different, and if you believe
doom forecasting like the comments at
http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc
that gap is widening, not shrinking.
Again, it depends.

With the write cache off on these disks they still are huge wins for
very-heavy-read applications, which many are. The issue is (as always)
operation mix - if you do a lot of inserts and updates then you suffer,
but a lot of database applications are in the high 90%+ SELECTs both in
frequency and data flow volume. The lack of rotational and seek latency
in those applications is HUGE.

-- Karl Denninger
attachment: karl.vcf
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Greg Smith


Karl Denninger wrote:

If power is unexpectedly removed from the system, this is true.  But
the caches on the SSD controllers are BUFFERS.  An operating system
crash does not disrupt the data in them or cause corruption.  An
unexpected disconnection of the power source from the drive (due to
unplugging it or a power supply failure for whatever reason) is a
different matter.
  
As standard operating procedure, I regularly get something writing heavy 
to the database on hardware I'm suspicious of and power the box off 
hard.  If at any time I suffer database corruption from this, the 
hardware is unsuitable for database use; that should never happen.  This 
is what I mean when I say something meets the mythical enterprise 
quality.  Companies whose data is worth something can't operate in a 
situation where money has been exchanged because a database commit was 
recorded, only to lose that commit just because somebody tripped over 
the power cord and it was in the buffer rather than on permanent disk.  
That's just not acceptable, and the even bigger danger of the database 
perhaps not coming up altogether even after such a tiny disaster is also 
very real with a volatile write cache.



With the write cache off on these disks they still are huge wins for
very-heavy-read applications, which many are.
Very read-heavy applications would do better to buy a ton of RAM instead 
and just make sure they populate from permanent media (say by reading 
everything in early at sequential rates to prime the cache).  There is 
an extremely narrow use-case where SSDs are the right technology, and 
it's only in a subset even of read-heavy apps where they make sense.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Karl Denninger

Greg Smith wrote:
 Karl Denninger wrote:
 If power is unexpectedly removed from the system, this is true.  But
 the caches on the SSD controllers are BUFFERS.  An operating system
 crash does not disrupt the data in them or cause corruption.  An
 unexpected disconnection of the power source from the drive (due to
 unplugging it or a power supply failure for whatever reason) is a
 different matter.
   
 As standard operating procedure, I regularly get something writing
 heavy to the database on hardware I'm suspicious of and power the box
 off hard.  If at any time I suffer database corruption from this, the
 hardware is unsuitable for database use; that should never happen. 
 This is what I mean when I say something meets the mythical
 enterprise quality.  Companies whose data is worth something can't
 operate in a situation where money has been exchanged because a
 database commit was recorded, only to lose that commit just because
 somebody tripped over the power cord and it was in the buffer rather
 than on permanent disk.  That's just not acceptable, and the even
 bigger danger of the database perhaps not coming up altogether even
 after such a tiny disaster is also very real with a volatile write cache.
Yep.  The plug test is part of my standard is this stable enough for
something I care about checkout.
 With the write cache off on these disks they still are huge wins for
 very-heavy-read applications, which many are.
 Very read-heavy applications would do better to buy a ton of RAM
 instead and just make sure they populate from permanent media (say by
 reading everything in early at sequential rates to prime the cache). 
 There is an extremely narrow use-case where SSDs are the right
 technology, and it's only in a subset even of read-heavy apps where
 they make sense.
I don't know about that in the general case - I'd say it depends.

250GB of SSD for read-nearly-always applications is a LOT cheaper than
250gb of ECC'd DRAM.  The write performance issues can be handled by
clever use of controller technology as well (that is, turn off the
drive's write cache and use the BBU on the RAID adapter.)

I have a couple of applications where two 250GB SSD disks in a Raid 1
array with a BBU'd controller, with the disk drive cache off, is all-in
a fraction of the cost of sticking 250GB of volatile storage in a server
and reading in the data set (plus managing the occasional updates) from
stable storage.  It is not as fast as stuffing the 250GB of RAM in a
machine but it's a hell of a lot faster than a big array of small
conventional drives in a setup designed for maximum IO-Ops.

One caution for those thinking of doing this - the incremental
improvement of this setup on PostGresql in WRITE SIGNIFICANT environment
isn't NEARLY as impressive.  Indeed the performance in THAT case for
many workloads may only be 20 or 30% faster than even reasonably
pedestrian rotating media in a high-performance (lots of spindles and
thus stripes) configuration and it's more expensive (by a lot.)  If you
step up to the fast SAS drives on the rotating side there's little
argument for the SSD at all (again, assuming you don't intend to cheat
and risk data loss.)

Know your application and benchmark it.

-- Karl
attachment: karl.vcf
-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] SSD + RAID

2009-11-13 Thread Merlin Moncure

On Fri, Nov 13, 2009 at 12:22 PM, Scott Carey
sc...@richrelevance.com  On 11/13/09 7:29 AM, Merlin Moncure
mmonc...@gmail.com wrote:

 On Fri, Nov 13, 2009 at 9:48 AM, Scott Marlowe scott.marl...@gmail.com
 wrote:
 I think RAID6 is gonna reduce the throughput due to overhead to
 something far less than what a software RAID-10 would achieve.

 I was wondering about this.  I think raid 5/6 might be a better fit
 for SSD than traditional drives arrays.  Here's my thinking:

 *) flash SSD reads are cheaper than writes.  With 6 or more drives,
 less total data has to be written in Raid 5 than Raid 10.  The main
 component of raid 5 performance penalty is that for each written
 block, it has to be read first than written...incurring rotational
 latency, etc.   SSD does not have this problem.


 For random writes, RAID 5 writes as much as RAID 10 (parity + data), and
 more if the raid block size is larger than 8k.  With RAID 6 it writes 50%
 more than RAID 10.

how does raid 5 write more if the block size is  8k? raid 10 is also
striped, so has the same problem, right?  IOW, if the block size is 8k
and you need to write 16k sequentially the raid 5 might write out 24k
(two blocks + parity).  raid 10 always writes out 2x your data in
terms of blocks (raid 5 does only in the worst case).  For a SINGLE
block, it's always 2x your data for both raid 5 and raid 10, so what i
said above was not quite correct.

raid 6 is not going to outperform raid 10 ever IMO.  It's just a
slightly safer raid 5.  I was just wondering out loud if raid 5 might
give similar performance to raid 10 on flash based disks since there
is no rotational latency.  even if it did, I probably still wouldn't
use it...

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

1 2 >

1 - 100 of 108 matches

Mail list logo