Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
One problem with the write cache is that I do not know if it is needed for write wearing ? As mentioned, disabeling write cache might be ok in terms of performance (I want to use MLC SSD as data disks, not as ZIL, to have a SSD only appliance - I'm looking for read speed for dedupe, zfs send and all the other things ZFS tends to do a lot of random reads for). I could not life with a degration in write endurance with a disabled write cache. Unfortunately nobody was able to anwer this and I guess only Intel can -- and won't. However I don't want to ruin 2 Postville SSD's for 200€ each to find out :). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
On Mon, 11 Jan 2010, Kjetil Torgrim Homme wrote: (BTW, thank you for testing forceful removal of power. the result is as expected, but it's good to see that theory and practice match.) Actually, the result is not "as expected" since the device should not have lost any data preceding a cache flush request. These sort of results should be cause for concern for anyone currently using one as a zfs log device, or using it for any write-sensitive application at all. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Maybe it is lost in this much text :) .. thus this re-post Does anyone know the impact of disabeling the write cache for the write amplification factor of the intel SSD's ? How can I permanently disable the write cache on the Intel X25-M SSD's ? Thanks, Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Lutz Schumann writes: > Actually the performance decrease when disableing the write cache on > the SSD is aprox 3x (aka 66%). for this reason, you want a controller with battery backed write cache. in practice this means a RAID controller, even if you don't use the RAID functionality. of course you can buy SSDs with capacitors, too, but I think that will be more expensive, and it will restrict your choice of model severely. (BTW, thank you for testing forceful removal of power. the result is as expected, but it's good to see that theory and practice match.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
On Sun, 10 Jan 2010, Lutz Schumann wrote: Talking about read performance. Assuming a reliable ZIL disk (cache flush = working): The ZIL can guarantee data integrity, however if the backend disks (aka pool disks) do not properly implement cache flush - a reliable ZIL device does not "workaround" the bad backend disks rigth ??? (meaning: having a reliable ZIL + some MLC SSD with write cache enabled is not reliable at the end) As soon as there is more than one disk in the pool, it is necessary for cache flush to work or else the devices may contain content from entirely different transaction groups, resulting in a scrambled pool. If you just had one disk which tended to ignore cache flush requests, then you should be ok as long as the disk writes the data in order. In that case any unwritten data would be lost, but the pool should not be lost. If the device ignores cache flush requests and writes data in some random order, then the pool is likely to eventually fail. I think that zfs mirrors should be safer than raidz when faced with devices which fail to flush (should be similar to the single-disk case), but only if there is one mirror pair. A scary thing about SSDs is that they may re-write old data while writing new data, which could result in corruption of the old data if the power fails while it is being re-written. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
Actually the performance decrease when disableing the write cache on the SSD is aprox 3x (aka 66%). Setup: node1 = Linux Client with open-iscsi server = comstar (cache=write through) + zvol (recordsize=8k, compression=off) --- with SSD-Disk-write cache disabled: node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Iozone: Performance Test of File I/O Version $Revision: 3.327 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Sun Jan 10 20:14:46 2010 Include fsync in write timing Include close in write timing Record Size 8 KB File size set to 131072 KB SYNC Mode. O_DIRECT feature enabled Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Output is in Kbytes/sec Time Resolution = 0.02 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Min process = 2 Max process = 2 Throughput test with 2 processes Each process writes a 131072 Kbyte file in 8 Kbyte records Children see throughput for 2 initial writers =1324.45 KB/sec Parent sees throughput for 2 initial writers =1291.27 KB/sec Min throughput per process = 646.07 KB/sec Max throughput per process = 678.38 KB/sec Avg throughput per process = 662.23 KB/sec Min xfer= 124832.00 KB Children see throughput for 2 rewriters=4360.29 KB/sec Parent sees throughput for 2 rewriters =4360.08 KB/sec Min throughput per process =2158.82 KB/sec Max throughput per process =2201.47 KB/sec Avg throughput per process =2180.15 KB/sec Min xfer= 128536.00 KB Children see throughput for 2 random readers= 43930.41 KB/sec Parent sees throughput for 2 random readers = 43914.01 KB/sec Min throughput per process = 21768.16 KB/sec Max throughput per process = 22162.25 KB/sec Avg throughput per process = 21965.21 KB/sec Min xfer= 128760.00 KB Children see throughput for 2 random writers=5561.01 KB/sec Parent sees throughput for 2 random writers =5560.41 KB/sec Min throughput per process =2780.37 KB/sec Max throughput per process =2780.64 KB/sec Avg throughput per process =2780.50 KB/sec Min xfer= 131064.00 KB ... with SSD write cache enabled node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Iozone: Performance Test of File I/O Version $Revision: 3.327 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root. Run began: Sun Jan 10 20:22:14 2010 Include fsync in write timing Include close in write timing Record Size 8 KB File size set to 131072 KB SYNC Mode. O_DIRECT feature enabled Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I Output is in Kbytes/sec Time Resolution = 0.02 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Min process = 2 Max process = 2 Throughput test with 2 processes Each process writes a 131072 Kbyte file in 8 Kbyte records Children see throughput for 2 initial writers =3387.15 KB/sec Parent sees throughput for 2 initial writers =3258.90 KB/sec Min
Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?
I managed to disable the write cache (did not know a tool on Solaris, hoever hdadm from the EON NAS binary_kit does the job): Same power discuption test with Seagate HDD and write cache disabled ... --- r...@nexenta:/volumes# .sc/bin/hdadm write_cache display c3t5 c3t5 write_cache> disabled ... pull power cable of Seagate SATA Disk This is round number 4543 DONE This is round number 4544 DONE This is round number 4545 DONE This is round number 4546 DONE This is round number 4547 DONE This is round number 4548 DONE This is round number 4549 DONE This is round number 4550 <... hangs here> ... power cycle everything node1:/mnt/disk# cat testfile This is round number 4549 ... this looks good. So disabeling the write cache helps, but limits the performance really (not for synchronous but for async writes). Test with Intel X25-M -- ... Same with SSD r...@nexenta:/volumes# hdadm write_cache off c3t5 c3t5 write_cache> disabled r...@nexenta:/volumes# hdadm write_cache display c3t5 c3t5 write_cache> disabled .. pull SSD power cable This is round number 9249 DONE This is round number 9250 DONE This is round number 9251 DONE This is round number 9252 DONE This is round number 9253 DONE This is round number 9254 DONE This is round number 9255 DONE This is round number 9256 DONE This is round number 9257 <... hangs here> .. power cycle everything ... test node1:/mnt/ssd# cat testfile This is round number 9256 So without a write cache the device works correctly However be warned on boot the cache is enabled again: DeviceSerialVendor Model Rev Temperature ------ - --- c3t5d0p0 7200Y5160AGN ATA INTEL SSDSA2M160 02HD 255 C (491 F) r...@nexenta:/volumes# hdadm write_cache display c3t5 c3t5 write_cache> enabled Question: Does anyone know the impact of disabeling the write cache for the write amplification factor of the intel SSD's ? I would deploy Intel X25-M only for "mostly read" workloads anyway. Thus the performance impact of disabeling the write cache can be ignored. However if the life expectency of the device goes down without a write cache (I means it is MLC already!) - Bummer. And another Question: How can I permanently disable the write cache on the Intel X25-M SSD's ? Regards -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS cache flush ignored by certain devices ?
A very interesting thread (http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/) and some thinking about the design of SSD's lead to a experiment I did with the Intel X25-M SSD. The question was: Is my data safe, once it has reached the disk and has been commited to my application ? All transactional safety in ZFS requires the correct impementation of the synchronize cache command (see http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg27264.html, where someone used Opensolaris within VirtualBox, which per default - does ignore the cache flush command). Thus qualified hardware is VERY essential (also see http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf). What I did (for a Intel X25-M G2 (default settings = write cache on) and a Seagate SATA drive (ST3500418AS)): a) Create a Pool b) Create a Programm that opens a file synchronously and writes to the file. It also prints the latest record written successfully. c) pull the power of the SATA disk d) power cycle everything e) open the pool again and verify the content of the file is the one that has been to the application e1) if it is the same - nice hardware e2) if it is NOT the same - BAD hardware What I found out was: Intel X25-M G2: - If I pull the power cable much data is lost, altought commited to the app (some hundred) - If I pull the sata cable no data is lost ST3500418AS: - If I pull the power cable almost no data is lost, but still the last write is lost (strange!) - If I pull the sata cable no data is lost Actually this result was partially expected. Howerver the one missing transaction in my SATA HDD Disk (Seagate) is strange. Unfortunately I do not have "enterprise SAS hardware" handy to verify that my test procedure is correct. Maybe someone can run this test on a SAS test machine ? (see script attached) --- Attachments --- --- script (call it with script.pl --file /mypool/testfile) --- #!/usr/bin/env perl # for O_SYNC use Fcntl qw(:DEFAULT :flock SEEK_CUR SEEK_SET SEEK_END); use IO::File; use Getopt::Long; my $pool="disk"; my $mountroot="/volumes"; my $file="$mountroot/$pool/testfile"; my $abort=0; my $count=0; GetOptions( "pool=s" => \$pool, "testfile|file=s" => \$file, "count=i" => \$count, ); my $dir = $file; $dir =~ s/[^\/]+$//g; if (-e $file) { print "ERROR: File $file already exists\n"; exit 1; } if (! -d "$dir" ) { print "ERROR: Directory $dir does not exist\n"; exit 1; } sysopen (FILE, "$file", O_RDWR | O_CREAT | O_EXCL | O_SYNC) or die "ERROR Opening file $file: $!\n"; $SIG{INT}= sub { print " ... signalling Abort ... (file: $file)\n"; $abort=1; }; $|=1; my $lastok=undef; my $i=0; my $msg=sprintf("This is round number %20s", $i); # O_SYNC, O_CREAT while (!$abort) { $i++; if ($count && $i>$count) { last; }; $msg=sprintf("This is round number %20s", $i); sysseek (FILE, 0, SEEK_SET); print "$msg"; my $rc=syswrite FILE,$msg; if (!defined($rc)) { print "ERROR\n"; print "ERROR While writing $msg\n"; print "ERROR: $!\n"; last; } else { print " DONE \n"; $lastok=$msg; } } close(FILE); print "\nTHE LAST MESSAGE WRITTEN to file $file was:\n\n\t\"$lastok\"\n\n"; Here's the logs of my tests 1) Test the SATA SSD (Intel X25-M) -- .. start write.pl This is round number67482 This is round number67483 This is round number67484 This is round number67485 This is round number67486 This is round number67487 This is round number67488 This is round number67489 This is round number67490 ( .. I pull the POWER CABLE of the SATA SSD .. ) .. I/O hangs .. zpool status shows zpool status -v pool: ssd state: UNAVAIL status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-JQ scrub: none requested config: NAMESTATE READ WRITE CKSUM ssd UNAVAIL 011 0 insufficient replicas c3t5d0UNAVAIL 3 2 0 cannot open errors: Permanent errors have been detected in the following files: ssd:<0x0> /volumes/ssd/ /volumes/ssd/testfile ... now I power cycled the machine and put back the power cable ... lets see the pool status pool: ssd state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM ssd ONLINE 0 0 0 c3t5d0ONLINE 0