Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-25 Thread Lutz Schumann
One problem with the write cache is that I do not know if it is needed for 
write wearing ? 

As mentioned, disabeling write cache might be ok in terms of performance (I 
want to use MLC SSD as data disks, not as ZIL, to have a SSD only appliance - 
I'm looking for read speed for dedupe, zfs send  and all the other things ZFS 
tends to do a lot of random reads for). 

I could not life with a degration in write endurance with a disabled write 
cache. Unfortunately nobody was able to anwer this and I guess only Intel can 
-- and won't. However I don't want to ruin 2 Postville SSD's for 200€ each to 
find out :).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Bob Friesenhahn

On Mon, 11 Jan 2010, Kjetil Torgrim Homme wrote:


(BTW, thank you for testing forceful removal of power.  the result is as
expected, but it's good to see that theory and practice match.)


Actually, the result is not "as expected" since the device should not 
have lost any data preceding a cache flush request.


These sort of results should be cause for concern for anyone 
currently using one as a zfs log device, or using it for any 
write-sensitive application at all.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Lutz Schumann
Maybe it is lost in this much text :) .. thus this re-post 

Does anyone know the impact of disabeling the write cache for the write 
amplification factor of the intel SSD's ?

How can I permanently disable the write cache on the Intel X25-M SSD's ? 

Thanks, Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-11 Thread Kjetil Torgrim Homme
Lutz Schumann  writes:

> Actually the performance decrease when disableing the write cache on
> the SSD is aprox 3x (aka 66%).

for this reason, you want a controller with battery backed write cache.
in practice this means a RAID controller, even if you don't use the RAID
functionality.  of course you can buy SSDs with capacitors, too, but I
think that will be more expensive, and it will restrict your choice of
model severely.

(BTW, thank you for testing forceful removal of power.  the result is as
expected, but it's good to see that theory and practice match.)
-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Bob Friesenhahn

On Sun, 10 Jan 2010, Lutz Schumann wrote:


Talking about read performance. Assuming a reliable ZIL disk (cache 
flush = working): The ZIL can guarantee data integrity, however if 
the backend disks (aka pool disks) do not properly implement cache 
flush - a reliable ZIL device does not "workaround" the bad backend 
disks rigth ???


(meaning: having a reliable ZIL + some MLC SSD with write cache 
enabled is not reliable at the end)


As soon as there is more than one disk in the pool, it is necessary 
for cache flush to work or else the devices may contain content from 
entirely different transaction groups, resulting in a scrambled pool.


If you just had one disk which tended to ignore cache flush requests, 
then you should be ok as long as the disk writes the data in order. 
In that case any unwritten data would be lost, but the pool should not 
be lost.  If the device ignores cache flush requests and writes data 
in some random order, then the pool is likely to eventually fail.


I think that zfs mirrors should be safer than raidz when faced with 
devices which fail to flush (should be similar to the single-disk 
case), but only if there is one mirror pair.


A scary thing about SSDs is that they may re-write old data while 
writing new data, which could result in corruption of the old data if 
the power fails while it is being re-written.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Lutz Schumann
Actually the performance decrease when disableing the write cache on the SSD is 
aprox 3x (aka 66%). 

Setup: 
  node1 = Linux Client with open-iscsi 
  server = comstar (cache=write through) + zvol (recordsize=8k, compression=off)

--- with SSD-Disk-write cache disabled: 

node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Iozone: Performance Test of File I/O
Version $Revision: 3.327 $
Compiled for 32 bit mode.
Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Sun Jan 10 20:14:46 2010

Include fsync in write timing
Include close in write timing
Record Size 8 KB
File size set to 131072 KB
SYNC Mode.
O_DIRECT feature enabled
Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Output is in Kbytes/sec
Time Resolution = 0.02 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 2
Max process = 2
Throughput test with 2 processes
Each process writes a 131072 Kbyte file in 8 Kbyte records

Children see throughput for  2 initial writers  =1324.45 KB/sec
Parent sees throughput for  2 initial writers   =1291.27 KB/sec
Min throughput per process  = 646.07 KB/sec
Max throughput per process  = 678.38 KB/sec
Avg throughput per process  = 662.23 KB/sec
Min xfer=  124832.00 KB

Children see throughput for  2 rewriters=4360.29 KB/sec
Parent sees throughput for  2 rewriters =4360.08 KB/sec
Min throughput per process  =2158.82 KB/sec
Max throughput per process  =2201.47 KB/sec
Avg throughput per process  =2180.15 KB/sec
Min xfer=  128536.00 KB

Children see throughput for 2 random readers=   43930.41 KB/sec
Parent sees throughput for 2 random readers =   43914.01 KB/sec
Min throughput per process  =   21768.16 KB/sec
Max throughput per process  =   22162.25 KB/sec
Avg throughput per process  =   21965.21 KB/sec
Min xfer=  128760.00 KB

Children see throughput for 2 random writers=5561.01 KB/sec
Parent sees throughput for 2 random writers =5560.41 KB/sec
Min throughput per process  =2780.37 KB/sec
Max throughput per process  =2780.64 KB/sec
Avg throughput per process  =2780.50 KB/sec
Min xfer=  131064.00 KB

... with SSD write cache enabled 

node1:/mnt/ssd# iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Iozone: Performance Test of File I/O
Version $Revision: 3.327 $
Compiled for 32 bit mode.
Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
 Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
 Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root.

Run began: Sun Jan 10 20:22:14 2010

Include fsync in write timing
Include close in write timing
Record Size 8 KB
File size set to 131072 KB
SYNC Mode.
O_DIRECT feature enabled
Command line used: iozone -ec -r 8k -s 128m -l 2 -i 0 -i 2 -o -I
Output is in Kbytes/sec
Time Resolution = 0.02 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Min process = 2
Max process = 2
Throughput test with 2 processes
Each process writes a 131072 Kbyte file in 8 Kbyte records

Children see throughput for  2 initial writers  =3387.15 KB/sec
Parent sees throughput for  2 initial writers   =3258.90 KB/sec
Min

Re: [zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Lutz Schumann
I managed to disable the write cache (did not know a tool on Solaris, hoever 
hdadm from the EON NAS binary_kit does the job): 

Same power discuption test with Seagate HDD and write cache disabled ...
---

r...@nexenta:/volumes# .sc/bin/hdadm write_cache display c3t5

 c3t5 write_cache> disabled

... pull power cable of Seagate SATA Disk 
 
This is round number 4543 DONE
This is round number 4544 DONE
This is round number 4545 DONE
This is round number 4546 DONE
This is round number 4547 DONE
This is round number 4548 DONE
This is round number 4549 DONE
This is round number 4550 <... hangs here>

... power cycle everything

node1:/mnt/disk# cat testfile
This is round number 4549

... this looks good. 

So disabeling the write cache helps, but limits the performance really (not for 
synchronous but for async writes). 

Test with Intel X25-M
--

... Same with SSD 
r...@nexenta:/volumes# hdadm write_cache off c3t5

 c3t5 write_cache> disabled

r...@nexenta:/volumes# hdadm write_cache display c3t5

 c3t5 write_cache> disabled

.. pull SSD power cable 

This is round number 9249 DONE
This is round number 9250 DONE
This is round number 9251 DONE
This is round number 9252 DONE
This is round number 9253 DONE
This is round number 9254 DONE
This is round number 9255 DONE
This is round number 9256 DONE
This is round number 9257 <... hangs here>

.. power cycle everything 
... test 

node1:/mnt/ssd# cat testfile
This is round number 9256

So without a write cache the device works correctly 

However be warned on boot the cache is enabled again: 

DeviceSerialVendor   Model Rev  Temperature
------   -  ---
c3t5d0p0  7200Y5160AGN  ATA  INTEL SSDSA2M160  02HD 255 C (491 F)

r...@nexenta:/volumes# hdadm write_cache display c3t5

 c3t5 write_cache> enabled

Question: Does anyone know the impact of disabeling the write cache for the 
write amplification factor of the intel SSD's ? 

I would deploy Intel X25-M only for "mostly read" workloads anyway. Thus the 
performance impact of disabeling the write cache can be ignored. However if the 
life expectency of the device goes down without a write cache (I means it is 
MLC already!) - Bummer.  

And another Question: How can I permanently disable the write cache on the 
Intel X25-M SSD's ? 

Regards
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS cache flush ignored by certain devices ?

2010-01-10 Thread Lutz Schumann
A very interesting thread 
(http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-fsync-write-cache-barrier-and-lost-transactions/)
 and some thinking about the design of SSD's lead to a experiment I did with 
the Intel X25-M SSD. The question was: 

Is my data safe, once it has reached the disk and has been commited to my 
application ?

All transactional safety in ZFS requires the correct impementation of the 
synchronize cache command (see 
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg27264.html, where 
someone used Opensolaris within VirtualBox, which  per default - does ignore 
the cache flush command). Thus qualified hardware is VERY essential (also see 
http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf).

What I did (for a Intel X25-M G2 (default settings = write cache on) and a 
Seagate SATA drive (ST3500418AS)): 

a) Create a Pool 
b) Create a Programm that opens a file 
   synchronously and writes to the file. 
   It also prints the latest record written 
   successfully.
c) pull the power of the SATA disk 
d) power cycle everything 
e) open the pool again and verify the content 
   of the file is the one that has been to 
   the application 
e1) if it is the same - nice hardware 
e2) if it is NOT the same - BAD hardware 

What I found out was: 

Intel X25-M G2: 
  - If I pull the power cable much data is lost, altought commited to the app 
(some hundred)
  - If I pull the sata cable no data is lost
  
ST3500418AS: 
  - If I pull the power cable almost no data is lost, but still the last write 
is lost (strange!)
  - If I pull the sata cable no data is lost

Actually this result was partially expected. Howerver the one missing 
transaction in my SATA HDD Disk (Seagate) is strange. 

Unfortunately I do not have "enterprise SAS hardware" handy to verify that my 
test procedure is correct.

Maybe someone can run this test on a SAS test machine ? (see script attached)


--- Attachments ---

--- script (call it with script.pl --file /mypool/testfile) ---

#!/usr/bin/env perl

# for O_SYNC
use Fcntl qw(:DEFAULT :flock SEEK_CUR SEEK_SET SEEK_END);
use IO::File;
use Getopt::Long;

my $pool="disk";
my $mountroot="/volumes";
my $file="$mountroot/$pool/testfile";
my $abort=0;
my $count=0;

GetOptions(
"pool=s" => \$pool,
"testfile|file=s" => \$file,
"count=i" => \$count,
);

my $dir = $file;
$dir =~ s/[^\/]+$//g;

if (-e $file) {
print "ERROR: File $file already exists\n";
exit 1;
}

if (! -d "$dir" ) {
print "ERROR: Directory $dir does not exist\n";
exit 1;
}
sysopen (FILE, "$file", O_RDWR | O_CREAT | O_EXCL | O_SYNC) or die "ERROR 
Opening file $file: $!\n";

$SIG{INT}= sub { print " ... signalling Abort ... (file: $file)\n"; $abort=1; };

$|=1;

my $lastok=undef;
my $i=0;
my $msg=sprintf("This is round number %20s", $i);
# O_SYNC, O_CREAT
while (!$abort) {
$i++;

if ($count && $i>$count) { last; };

$msg=sprintf("This is round number %20s", $i);
sysseek (FILE, 0, SEEK_SET);
print "$msg";
my $rc=syswrite FILE,$msg;
if (!defined($rc)) {
print "ERROR\n";
print "ERROR While writing $msg\n";
print "ERROR: $!\n";
last;
} else {
print " DONE \n";
$lastok=$msg;
}
}

close(FILE);

print "\nTHE LAST MESSAGE WRITTEN to file $file was:\n\n\t\"$lastok\"\n\n";

 Here's the logs of my tests 

1) Test the SATA SSD (Intel X25-M) 
--
.. start write.pl

This is round number67482
This is round number67483
This is round number67484
This is round number67485
This is round number67486
This is round number67487
This is round number67488
This is round number67489
This is round number67490

( .. I pull the POWER CABLE of the SATA SSD .. )

.. I/O hangs 

.. zpool status shows 

zpool status -v
  pool: ssd
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-JQ
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
ssd UNAVAIL  011 0  insufficient replicas
  c3t5d0UNAVAIL  3 2 0  cannot open

errors: Permanent errors have been detected in the following files:

ssd:<0x0>
/volumes/ssd/
/volumes/ssd/testfile


... now I power cycled the machine and put back the power cable 

... lets see the pool status 

  pool: ssd
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
ssd ONLINE   0 0 0
  c3t5d0ONLINE   0