Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi list, If you're running solaris proper, you better mirror your ZIL log device. ... I plan to get to test this as well, won't be until late next week though. Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up normally with the following message in 'zpool status': ... pool: mypool state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run 'zpool online', or ignore the intent log records by running 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool FAULTED 0 0 0 bad intent log ... Nice! Running a later version of ZFS seems to lessen the need for ZIL-mirroring... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi list, If you're running solaris proper, you better mirror your ZIL log device. ... I plan to get to test this as well, won't be until late next week though. Running OSOL nv130. Power off the machine, removed the F20 and power back on. Machines boots OK and comes up normally with the following message in 'zpool status': ... pool: mypool state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run 'zpool online', or ignore the intent log records by running 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAMESTATE READ WRITE CKSUM mypool FAULTED 0 0 0 bad intent log ... Nice! Running a later version of ZFS seems to lessen the need for ZIL-mirroring... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Roch, Can you try 4 concurrent tar to four different ZFS filesystems (same pool). Hmmm, you're on to something here: http://www.science.uva.nl/~jeroen/zil_compared_e1000_iostat_iops_svc_t_10sec_interval.pdf In short: when using two exported file systems total time goes down to around 4mins (IOPS maxes out at around 5500 when adding all four vmods together). When using four file systems total time goes down to around 3min30s (IOPS maxing out at about 9500). I figured it is either NFS or a per file system data structure in the ZFS/ZIL interface. To rule out NFS I tried exporting two directories using default NFS shares (via /etc/dfs/dfstab entries). To my surprise this seems to bypass the ZIL all together (dropping to 100 IOPS, which results from our RAIDZ2 configuration). So clearly ZFS sharenfs is more than a nice front end for NFS configuration :). But back to your suggestion: You clearly had a hypothesis behind your question. Care to elaborate? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Al, Have you tried the DDRdrive from Christopher George cgeo...@ddrdrive.com? Looks to me like a much better fit for your application than the F20? It would not hurt to check it out. Looks to me like you need a product with low *latency* - and a RAM based cache would be a much better performer than any solution based solely on flash. Let us know (on the list) how this works out for you. Well, I did look at it but at that time there was no Solaris support yet. Right now it seems there is only a beta driver? I kind of remember that if you'd want reliable fallback to nvram, you'd need an UPS feeding the card. I could be very wrong there, but the product documentation isn't very clear on this (at least to me ;) ) Also, we'd kind of like to have a SnOracle supported option. But yeah, on paper it does seem it could be an attractive solution... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Casper, :-) Leuk te zien dat je straal nog steeds even ver komt :-) I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS backed) NFS export... Anyway we seem to be getting of topic here :-) The thread was started to get insight in behaviour of the F20 as ZIL. _My_ particular interest would be to be able to answer why perfomance doesn't seem to scale up when adding vmod-s... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
It doesn't have to be F20. You could use the Intel X25 for example. The mlc-based disks are bound to be too slow (we tested with an OCZ Vertex Turbo). So you're stuck with the X25-E (which Sun stopped supporting for some reason). I believe most normal SSDs do have some sort of cache and usually no supercap or other backup power solution. So be wary of that. Having said all this, the new Sandforce based SSDs look promising... If you're running solaris proper, you better mirror your ZIL log device. Absolutely true, I forgot this 'cause we're running OSOL nv130... (we constantly seem to need features that haven't landed in Solaris proper :) ). If you're running opensolaris ... I don't know if that's important. At least I can confirm ability of adding and removing ZIL devices on the fly with OSOL of a sufficiently recent build. I'll probably test it, just to be sure, but I might never get around to it because I don't have a justifiable business reason to build the opensolaris machine just for this one little test. I plan to get test this as well, won't be until late next week though. With kind regards, Jeroen Message was edited by: tuxwield -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption. Hmmm, I thought that in that case ZFS reverts to the regular on disks ZIL? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The write cache is _not_ being disabled. The write cache is being marked as non-volatile. Of course you're right :) Please filter my postings with a sed 's/write cache/write cache flush/g' ;) BTW, why is a Sun/Oracle branded product not properly respecting the NV bit in the cache flush command? This seems remarkably broken, and leads to the amazingly bad advice given on the wiki referenced above. I suspect it has something to do with emulating disk semantics over PCIE. Anyway, this did get us stumped in the beginning, performance wasn't better than when using an OCZ Vertex Turbo ;) By the way, the URL to the reference is part of the official F20 product documentation (that's how we found it in the first place)... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Karsten, But is this mode of operation *really* safe? As far as I can tell it is. -The F20 uses some form of power backup that should provide power to the interface card long enough to get the cache onto solid state in case of power failure. -Recollecting from earlier threads here; in case the card fails (but not the host), there should be enough data residing in memory for ZFS to safely switch to the regular on disk ZIL. -According to my contacts at Sun, the F20 is a viable replacement solution for the X25-E. -Switching write caching off seems to be officially recommended on the Sun performance wiki (translated to more sane defaults). If I'm wrong here I'd like to know too, 'cause this is probably the way we're taking it in production. :) With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Richard, For this case, what is the average latency to the F20? I'm not giving the average since I only performed a single run here (still need to get autopilot set up :) ). However here is a graph of iostat IOPS/svc_t sampled in 10sec intervals during a run of untarring an eclipse tarbal 40 times from two hosts. I'm using 1 vmod here. http://www.science.uva.nl/~jeroen/zil_1slog_e1000_iostat_iops_svc_t_10sec_interval.pdf Maximum svc_t is around 2.7ms averaged over 10s. Still wondering why this won't scale out though. We don't seem to be CPU bound, unless ZFS limits itself to max 30% cputime? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you are going to trick the system into thinking a volatile cache is nonvolatile, you might as well disable the ZIL -- the data corruption potential is the same. I'm sorry? I believe the F20 has a supercap or the like? The advise on: http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100 Is to disable write caching altogether. We opted not to do _that_ though... :) Are you sure about disabling write cache on the F20 is a bad thing to do? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Hi Freddie, list, Option 4 is to re-do your pool, using fewer disks per raidz2 vdev, giving more vdevs to the pool, and thus increasing the IOps for the whole pool. 14 disks in a single raidz2 vdev is going to give horrible IO, regardless of how fast the individual disks are. Redoing it with 6-disk raidz2 vdevs, or even 8-drive raidz2 vdevs will give you much better throughput. We are aware of the configuration being possibly suboptimal. However, before we had the SSDs, we did test earlier with 6x7 Z2 and even 2way mirrorset setups. These gave better IOPS but not significantly enough improvement (I would expect roughly a bit more than double the performance in 14x3 vs 6x7) . In the end it is indeed a choice between performance, space and security. Our hope is that the SSD slogs serialise the data flow enough to make this work. But you have a fair point and we will also look into the combination of SSDs and pool-configurations. Also, possibly the Vortex Turbo SSDs aren't as good latency wise as I expected. Maybe the Sun SSDs will do a lot better. We will find this out when they arrive (due somewhere in february). With kind regards, Jeroen - -- Jeroen Roodhart IT Consultant University of Amsterdam j.r.roodh...@uva.nl Informatiseringscentrum Tel. 020 525 7203 - -- See http://www.science.uva.nl/~jeroen for openPGP public key -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFLNURP37AP1zFtDU0RA0QqAKDg/hr68JPjLvc0gaOmCe4RxPXY3QCg1G+g e6BTEKqq6QxpePonnn54fOo= =JQcb -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Mattias Pantzare wrote: The ZIL is _not_ optional as the log is in UFS. Right, thanks (also to Richard and Daniel) for the explanation. I was afraid this was to good to be true, nice to see it stated this clearly though. That would leave us with three options; 1) Deal with it and accept performance as it is. 2) Find a way to speed things up further for this workload 3) Stop trying to use ZFS for this workload Option 1 is not going to be feasible, so we're left with 2 and 3. We will have to do some more benchmarks in the new year. Maybe if trying different NFS wsize-s results in different figures. Also we'll look at UFS on the Thor, although I am not looking forward to handle large amounts of data on anything other than ZFS. Spoiled for life probably :) In the mean time, if any of you would have time to look at our iozone data and spotted glaring mistakes, we would definitely appreciate your comments. Thanks for your help, With kind regards, Jeroen - -- Jeroen Roodhart IT Consultant University of Amsterdam j.r.roodh...@uva.nl Informatiseringscentrum Tel. 020 525 7203 - -- See http://www.science.uva.nl/~jeroen for openPGP public key -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD4DBQFLMynd37AP1zFtDU0RA9dNAJdhfwGH7Dj3cXBCX3MS/zTaV/c+AKCozNJn kmxtdS9Vu/sM/icXTE0hsA== =fP4d -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Jeroen Roodhart wrote: Questions: 1. Client wsize? We usually set these to 342768 but this was tested with CenOS defaults: 8192 (were doing this over NFSv3) Is stand corrected here. Looking at proc/mounts I see we are in fact using different values: ... 10.0.0.1:/mypool/test_FS /nfs nfs rw,vers=3,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=10.0.0.1 0 0 So wsize was 1048576 during the iozone tests. That'll teach me to rely on manuals :) So repeating these tests with different wsizes seems to be a smart thing to do. With kind regards, Jeroen - -- Jeroen Roodhart IT Consultant University of Amsterdam j.r.roodh...@uva.nl Informatiseringscentrum Tel. 020 525 7203 - -- See http://www.science.uva.nl/~jeroen for openPGP public key -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFLMy0737AP1zFtDU0RA5OmAKDtt7LMRu/a36PBa+Fg5vL2pDKdGACdFIkU fWkmBPJvNwoQpm2A4Y3SorA= =gaX7 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Hi Richard, Richard Elling wrote: How about posting the data somewhere we can see it? As stated in an earlier posting it should be accessible at: http://init.science.uva.nl/~jeroen/solaris11_iozone_nfs2zfs Happy holidays! ~Jeroen - -- Jeroen Roodhart IT Consultant University of Amsterdam j.r.roodh...@uva.nl Informatiseringscentrum Tel. 020 525 7203 - -- See http://www.science.uva.nl/~jeroen for openPGP public key -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFLM9Pf37AP1zFtDU0RA0ZxAJ9IJmn8fc6sBWi2KG7sVlL8RLiZ0wCbBcbL 4+5IT9mxhgIqRm+j5Mx9Kqk= =nKUC -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benchmarks results for ZFS + NFS, using SSD's as slog devices (ZIL)
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Hi Richard, ZFS-discuss. Message: 2 Date: Wed, 23 Dec 2009 09:49:18 -0800 From: Richard Elling richard.ell...@gmail.com To: Auke Folkerts folke...@science.uva.nl Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Benchmarks results for ZFS + NFS,using SSD's as slog devices (ZIL) Message-ID: 40070921-f894-4146-9e4c-7570d52c8...@gmail.com Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Some questions below... On Dec 23, 2009, at 8:27 AM, Auke Folkerts wrote: Filling in for Auke here, The raw data as well as the graphs that I created are available on request, should people be interested. Yes, can you post somewhere? I've put the results here, tests are run under nv129: http://www.science.uva.nl/~jeroen/solaris11_iozone_nfs2zfs Original measurements (with iozone headers) are in: http://www.science.uva.nl/~jeroen/solaris11_iozone_nfs2zfs/originals/ Questions: 1. Client wsize? We usually set these to 342768 but this was tested with CenOS defaults: 8192 (were doing this over NFSv3) 2. Client NFS version? NFSv3 (earlier tests show about 15% improvement using v4, but we still use v3 in production). 3. logbias settings? Throughput for runs labeled throughput otherwise defaults. 4. Did you test with a Solaris NFS client? If not, why not? We didn't, because our production environment consists of Solaris servers and Linux/MS Windows clients. UFS is a totally different issue, sync writes are always sync'ed. I don't work for Sun, but it would be unusual for a company to accept willful negligence as a policy. Ambulance chasing lawyers love that kind of thing. The Thor replaces a geriatric Enterprise system running Solaris 8 over UFS. For these workloads it beat the pants out of our current setup and somehow the but you're safer now argument doesn't go over very well :) We are under the impression that a setup that server NFS over UFS has the same assurance level than a setup using ZFS without ZIL. Is this impression false? If it isn't then offering a tradeoff between same assurance level as you are used to with better performance or better assurance level but for random-IO significant performance hits doesn't seem too wrong to me. In the first case you still have the ZFS guarantees once data is on disk... Thanks in advance for your insights, With kind regards, Jeroen - -- Jeroen Roodhart IT Consultant University of Amsterdam j.r.roodh...@uva.nl Informatiseringscentrum Tel. 020 525 7203 - -- See http://www.science.uva.nl/~jeroen for openPGP public key -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFLMqKT37AP1zFtDU0RAxeCAKDcglo2n0Q8Sx0tGyzx+MEGJt90TwCfWm59 JbGdTavhenqSrQEtGUvPZaw= =K25S -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SNV_125 MPT warning in logfile
How did your migration to ESXi go? Are you using it on the same hardware or did you just switch that server to an NFS server and run the VMs on another box? The latter, we run these VMs over NFS anyway and had ESXi boxes under test already. we were already separating data exports from VM exports. We use an in-house developed configuration management/bare metal system which allows us to install new machines pretty easily. In this case we just provisioned the ESXi VMs to new VM exports on the Thor whilst re-using the data-exports as they were... Works pretty well, although the Sun x1027A 10G NICs aren't yet supported under ESXi 4... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SNV_125 MPT warning in logfile
I'm running nv126 XvM right now. I haven't tried it without XvM. Without XvM we do not see these issues. We're running the VMs through NFS now (using ESXi)... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SNV_125 MPT warning in logfile
We see the same issue on a x4540 Thor system with 500G disks: lots of: ... Nov 3 16:41:46 uva.nl scsi: [ID 107833 kern.warning] WARNING: /p...@3c,0/pci10de,3...@f/pci1000,1...@0 (mpt5): Nov 3 16:41:46 encore.science.uva.nl Disconnected command timeout for Target 7 ... This system is running nv125 XvM. Seems to occur more when we are using vm-s. This of course causes very long interruptions on the vm-s as well... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss