[zfs-discuss] ZFS, XFS, and EXT4 compared
I have a lot of people whispering zfs in my virtual ear these days, and at the same time I have an irrational attachment to xfs based entirely on its lack of the 32000 subdirectory limit. I'm not afraid of ext4's newness, since really a lot of that stuff has been in Lustre for years. So a-benchmarking I went. Results at the bottom: http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html Short version: ext4 is awesome. zfs has absurdly fast metadata operations but falls apart on sequential transfer. xfs has great sequential transfer but really bad metadata ops, like 3 minutes to tar up the kernel. It would be nice if mke2fs would copy xfs's code for optimal layout on a software raid. The mkfs defaults and the mdadm defaults interact badly. Postmark is somewhat bogus benchmark with some obvious quantization problems. Regards, jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
Jeffrey, it would be interesting to see your zpool layout info as well. It can significantly influence the results obtained in the benchmarks. On 8/30/07, Jeffrey W. Baker [EMAIL PROTECTED] wrote: I have a lot of people whispering zfs in my virtual ear these days, and at the same time I have an irrational attachment to xfs based entirely on its lack of the 32000 subdirectory limit. I'm not afraid of ext4's newness, since really a lot of that stuff has been in Lustre for years. So a-benchmarking I went. Results at the bottom: http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html Short version: ext4 is awesome. zfs has absurdly fast metadata operations but falls apart on sequential transfer. xfs has great sequential transfer but really bad metadata ops, like 3 minutes to tar up the kernel. It would be nice if mke2fs would copy xfs's code for optimal layout on a software raid. The mkfs defaults and the mdadm defaults interact badly. Postmark is somewhat bogus benchmark with some obvious quantization problems. Regards, jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Samba with ZFS ACL
Hi, I'm looking for Samba, which work native ZFS ACL. With ZFS almost everything work except native ZFS ACL. I have learned on samba mailing list, that it dosn't work while samba-3.2.0 will be released. Has anyone knows any solution to work samba-3.0.25? If any idea, please let me know. thanks This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] DMU as general purpose transaction engine?
ZFS Experts, Is it possible to use DMU as general purpose transaction engine? More specifically, in following order: 1. Create transaction: tx = dmu_tx_create(os); error = dmu_tx_assign(tx, TXG_WAIT) 2. Decide what to modify(say create new object): dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT); dmu_tx_hold_bonus(tx, dzp-z_id); dmu_tx_hold_zap(tx, dzp-z_id, TRUE, name); | | 3. Commit transaction: dmu_tx_commit(tx); The reason I am asking for this particular order because I may not know the intent of transaction till late in the process. If it is not possible, can I at least declare that the transaction is going to change N objects (without specification of each object) and each change is M blocks at most (without specification of object and offset). If yes, how? Thanks, -Atul -- Atul Vidwansa Cluster File Systems Inc. http://www.clusterfs.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Samba with ZFS ACL
Please read this thread on my blog http://blogs.sun.com/timthomas/entry/samba_and_swat_in_solaris. This question has been addressed in the comments. Yoshikuni.Yanagiya said the following : Hi, I'm looking for Samba, which work native ZFS ACL. With ZFS almost everything work except native ZFS ACL. I have learned on samba mailing list, that it dosn't work while samba-3.2.0 will be released. Has anyone knows any solution to work samba-3.0.25? If any idea, please let me know. thanks This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Sun Logo http://www.sun.com http://www.sun.com*Tim Thomas *Storage Systems Product Group* * Sun Microsystems, Inc. Internal Extension: x(70)18097 Office Direct Dial: +44-161-905-8097 Mobile: +44-7802-212-209 Email: [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] import zfs dataset online in a zone
works like a charm! thank you very much Darren! greetings, Stoyan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS compared to ASM
Hello ZFS folks, with the deployment of HA-ZFS we are in the process of migrating some SVM based cluster services to ZFS. What we couldn't find is an up-to-date performance comparison between Oracle's own ASM technology compared to ZFS performance, I could only find performance figures from 2006... (http://blogs.sun.com/roch/ , http://blogs.sun.com/realneel/entry/zfs_and_databases ) Should we give up on this one and rather let the Oracle database clusters remain on SVM/raw devices? (depending the perf. difference we might just convince the customer using the extra features zfs can provide..) Can zfs provide raw access to disks , in case the client would still insist on ASM ? I have found a blog where they say zfs does not provide raw disk access at all, but what about zfs -V - create a raw device for swap ? Thanks, CSaba -- __ /_/\ /_\\ \ Csaba Balogh /_\ \\ / Mission Critical Support Engineer /_/ \/ / /Sun Microsystems Austria /_/ / \//\ Wienerbergstrasse 3/VII \_\//\ / / A-1100 Wien \_/ / /\ /| \_/ \\ \ Mobile : +43 664 6056311933 \_\ \\ Direct Tel: +431605 6311933(x60933) \_\/ Email: [EMAIL PROTECTED] Sun System Documentation-- http://docs.sun.com Need help in a Hurry ? http://SunSolve.Sun.COM Online Service Centre- http://www.Sun.COM/service/online ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs-code] DMU as general purpose transaction engine?
I am not an expert but I think the correct sequence is : 1. dmu_tx_create() 2. dmu_tx_hold_() 3. dmu_tx_assign() 4. modify the objects as part of the transaction 5. dmu_tx_commit() see comments in common/fs/zfs/sys/dmu.h Thanks Bhaskar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] problem: file copy's aren't getting the current file
I'm not sure if this is a zfs, zones, or solaris/nfs problem... So I'll start on this alias... Problem: I am seeing file copies from one machine to another grab an older file. (Worded differently: The cp command is not getting the most recent file.) For instance, On a T2000, Solaris 10u3, with zfs setup, and a zone I try to copy in a file from my swan home directory to a directory in the zone ... The file copied, is not the file currently in my home directory. It is an older version of it. I've suspected this for some time (months) but today was the first time I could actually see it happen. The niagara box seems to pull the file from some cache, but where? Thanks in advance for any pointers or configuration advice. This is wreaking havoc on my testing. Russ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem: file copy's aren't getting the current file
On Thu, Aug 30, 2007 at 10:18:05AM -0700, Russ Petruzzelli wrote: I'm not sure if this is a zfs, zones, or solaris/nfs problem... So I'll start on this alias... Problem: I am seeing file copies from one machine to another grab an older file. (Worded differently: The cp command is not getting the most recent file.) For instance, On a T2000, Solaris 10u3, with zfs setup, and a zone I try to copy in a file from my swan home directory to a directory in the zone ... ^^^ Presumably that means NFS mounted, and that the actual FS is not a ZFS. The file copied, is not the file currently in my home directory. It is an older version of it. I've suspected this for some time (months) but today was the first time I could actually see it happen. The niagara box seems to pull the file from some cache, but where? NFS. Thanks in advance for any pointers or configuration advice. This is wreaking havoc on my testing. It's probably nothing to do with ZFS. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem: file copy's aren't getting the current file
NFS clients can cache. This cache can be loosely synchronized for performance reasons. See the settings for actimeo and related variables in mount_nfs(1m) -- richard Russ Petruzzelli wrote: I'm not sure if this is a zfs, zones, or solaris/nfs problem... So I'll start on this alias... Problem: I am seeing file copies from one machine to another grab an older file. (Worded differently: The cp command is not getting the most recent file.) For instance, On a T2000, Solaris 10u3, with zfs setup, and a zone I try to copy in a file from my swan home directory to a directory in the zone ... The file copied, is not the file currently in my home directory. It is an older version of it. I've suspected this for some time (months) but today was the first time I could actually see it happen. The niagara box seems to pull the file from some cache, but where? Thanks in advance for any pointers or configuration advice. This is wreaking havoc on my testing. Russ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem: file copy's aren't getting the current file
On Aug 30, 2007, at 12:35 PM, Richard Elling wrote: NFS clients can cache. This cache can be loosely synchronized for performance reasons. See the settings for actimeo and related variables in mount_nfs(1m) The NFS client will getattr/OPEN at the point where the application opens the file (close to open consistency) and actimeo will not change that behavior. The nocto mount option will disable that. If the client is copying an older version of the file, then the client is either not checking the file's modification time correctly or the NFS server is not telling the truth. Spencer -- richard Russ Petruzzelli wrote: I'm not sure if this is a zfs, zones, or solaris/nfs problem... So I'll start on this alias... Problem: I am seeing file copies from one machine to another grab an older file. (Worded differently: The cp command is not getting the most recent file.) For instance, On a T2000, Solaris 10u3, with zfs setup, and a zone I try to copy in a file from my swan home directory to a directory in the zone ... The file copied, is not the file currently in my home directory. It is an older version of it. I've suspected this for some time (months) but today was the first time I could actually see it happen. The niagara box seems to pull the file from some cache, but where? Thanks in advance for any pointers or configuration advice. This is wreaking havoc on my testing. Russ - --- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single SAN Lun presented to 4 Hosts
No. You can neither access ZFS nor UFS in that way. Only one host can mount the file system at the same time (read/write or read-only doesn't matter here). [...] If you don't want to use NFS, you can use QFS in such a configuration. The shared writer approach of QFS allows mounting the same file system on different hosts at the same time. Thank you. We had been using multiple read-only UFS monts and one R/W mount as a poor-man's technique to move data between SAN-connected hosts. Based on your discussion, this appears to be a Really Bad Idea[tm]. That said, is there a HOWTO anywhere on installing QFS on Solaris 9 (Sparc64) machines? Is that even possible? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] OT: QFS question WAS:Single SAN Lun presented to 4 Hosts
On 8/30/07, Peter L. Thomas [EMAIL PROTECTED] wrote: That said, is there a HOWTO anywhere on installing QFS on Solaris 9 (Sparc64) machines? Is that even possible? I don't know of a How To, but I assume the manual has instructions. When I took the Sun SAM-FS / QFS technical training many years ago, they were supported on Soalris 2.6, 7, and 8 (which tells you how long ago that was), so I assume Solaris 9 is (or was) supported. -- Paul Kraus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem: file copy's aren't getting the current file
On 8/30/07, Russ Petruzzelli [EMAIL PROTECTED] wrote: For instance, On a T2000, Solaris 10u3, with zfs setup, and a zone I try to copy in a file from my swan home directory to a directory in the zone ... The file copied, is not the file currently in my home directory. It is an older version of it. Assuming (as others have said) this is via and NFS mount, I have seen this but only with very recently modified files. Waiting 5 seconds seems to be enough to insure that the current file is read and not the older (cached) version. This was with Solaris 10 on both ends and UFS shared via NFS. Definitely not a ZFS issue :-) -- Paul Kraus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OT: QFS question WAS:Single SAN Lun presented to 4 Hosts
On Thu, 2007-08-30 at 14:03 -0400, Paul Kraus wrote: On 8/30/07, Peter L. Thomas [EMAIL PROTECTED] wrote: That said, is there a HOWTO anywhere on installing QFS on Solaris 9 (Sparc64) machines? Is that even possible? I don't know of a How To, but I assume the manual has instructions. When I took the Sun SAM-FS / QFS technical training many years ago, they were supported on Soalris 2.6, 7, and 8 (which tells you how long ago that was), so I assume Solaris 9 is (or was) supported. As Paul mentions, SAMQ will definitely run on S9. One source for the docs are here: http://docs.sun.com/app/docs/prod/samfs?l=en#hic http://docs.sun.com/app/docs/prod/qfs?l=en#hic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
I'll take a look at this. ZFS provides outstanding sequential IO performance (both read and write). In my testing, I can essentially sustain hardware speeds with ZFS on sequential loads. That is, assuming 30-60MB/sec per disk sequential IO capability (depending on hitting inner or out cylinders), I get linear scale-up on sequential loads as I add disks to a zpool, e.g. I can sustain 250-300MB/sec on a 6 disk zpool, and it's pretty consistent for raidz and raidz2. Your numbers are in the 50-90MB/second range, or roughly 1/2 to 1/4 what was measured on the other 2 file systems for the same test. Very odd. Still looking... Thanks, /jim Jeffrey W. Baker wrote: I have a lot of people whispering zfs in my virtual ear these days, and at the same time I have an irrational attachment to xfs based entirely on its lack of the 32000 subdirectory limit. I'm not afraid of ext4's newness, since really a lot of that stuff has been in Lustre for years. So a-benchmarking I went. Results at the bottom: http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html Short version: ext4 is awesome. zfs has absurdly fast metadata operations but falls apart on sequential transfer. xfs has great sequential transfer but really bad metadata ops, like 3 minutes to tar up the kernel. It would be nice if mke2fs would copy xfs's code for optimal layout on a software raid. The mkfs defaults and the mdadm defaults interact badly. Postmark is somewhat bogus benchmark with some obvious quantization problems. Regards, jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Thu, 2007-08-30 at 14:33 -0400, Jim Mauro wrote: Your numbers are in the 50-90MB/second range, or roughly 1/2 to 1/4 what was measured on the other 2 file systems for the same test. Very odd. Yeah it's pretty odd. I'd tend to blame the Areca HBA, but then I'd also point out that the HBA is Verified by Sun. -jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Thu, 2007-08-30 at 08:37 -0500, Jose R. Santos wrote: On Wed, 29 Aug 2007 23:16:51 -0700 Jeffrey W. Baker [EMAIL PROTECTED] wrote: http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html FFSB: Could you send the patch to fix FFSB Solaris build? I should probably update the Sourceforge version so that it built out of the box. Sadly I blew away OpenSolaris without preserving the patch, but the gist of it is this: ctime_r takes three parameters on Solaris (the third is the buffer length) and Solaris has directio(3c) instead of O_DIRECT. I'm also curious about your choices in the FFSB profiles you created. Specifically, the very short run time and doing fsync after every file close. When using FFSB, I usually run with a large run time (usually 600 seconds) to make sure that we do enough IO to get a stable result. With a 1GB machine and max I/O of 200MB/s, I assumed 30 seconds would be enough for the machine to quiesce. You disagree? The fsync flag is in there because my primary workload is PostgreSQL, which is entirely synchronous. Running longer means that we also use more of the disk storage and our results are not base on doing IO to just the beginning of the disk. When running for that long period of time, the fsync flag is not required since we do enough reads and writes to cause memory pressure and guarantee IO going to disk. Nothing wrong in what you did, but I wonder how it would affect the results of these runs. So do I :) I did want to finish the test in a practical amount of time, and it takes 4 hours for the RAID to build. I will do a few hours-long runs of ffsb with Ext4 and see what it looks like. The agefs options you use are also interesting since you only utilize a very small percentage of your filesystem. Also note that since create and append weight are very heavy compare to deletes, the desired utilization would be reach very quickly and without that much fragmentation. Again, nothing wrong here, just very interested in your perspective in selecting these setting for your profile. The aging takes forever, as you are no doubt already aware. It requires at least 1 minute for 1% utilization. On a longer run, I can do more aging. The create and append weights are taken from the README. Don't mean to invalidate the Postmark results, just merely pointing out a possible error in the assessment of the meta-data performance of ZFS. I say possible since it's still unknown if another workload will be able to validate these results. I don't want to pile scorn on XFS, but the postmark workload was chosen for a reasonable run time on XFS, and then it turned out that it runs in 1-2 seconds on the other filesystems. The scaling factors could have been better chosen to exercise the high speeds of Ext4 and ZFS. The test needs to run for more than a minute to get meaningful results from postmark, since it uses truncated whole number seconds as the denominator when reporting. One thing that stood out from the postmark results is how ext4/sw has a weird inverse scaling with respect to the number of subdirectories. It's faster with 1 files in 1 directory than with 100 files each in 100 subdirectories. Odd, no? Did you gathered CPU statistics when running these benchmarks? I didn't bother. If you buy a server these days and it has fewer than four CPUs, you got ripped off. -jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Aug 29, 2007, at 11:16 PM, Jeffrey W. Baker wrote: I have a lot of people whispering zfs in my virtual ear these days, and at the same time I have an irrational attachment to xfs based entirely on its lack of the 32000 subdirectory limit. I'm not afraid of ext4's newness, since really a lot of that stuff has been in Lustre for years. So a-benchmarking I went. Results at the bottom: http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html Short version: ext4 is awesome. zfs has absurdly fast metadata operations but falls apart on sequential transfer. xfs has great sequential transfer but really bad metadata ops, like 3 minutes to tar up the kernel. It would be nice if mke2fs would copy xfs's code for optimal layout on a software raid. The mkfs defaults and the mdadm defaults interact badly. Postmark is somewhat bogus benchmark with some obvious quantization problems. Regards, jwb Hey jwb, Thanks for taking up the task, its benchmarking so i've got some questions... What does it mean to have an external vs. internal journal for ZFS? Can you show the output of 'zpool status' when using software RAID vs. hardware RAID for ZFS? The hardware RAID has a cache on the controller. ZFS will flush the cache when pushing out a txg (essentially before writing out the uberblock and after writing out the uberblock). When you have a non- volatile cache with battery backing (such as your setup), its safe to disable that via putting 'set zfs:zfs_nocacheflush = 1' in /etc/ system and rebooting. Its ugly but we're going through the final code review of a fix for this (its partly we aren't sending down the right command and partly even if we did, no storage devices actually support it quite yet). What parameters did you give bonnie++? compiled 64bit, right? For the randomio test, it looks like you used an io_size of 4KB. Are those aligned? random? How big is the '/dev/sdb' file? Do you have the parameters given to FFSB? eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Thu, 2007-08-30 at 12:07 -0700, eric kustarz wrote: Hey jwb, Thanks for taking up the task, its benchmarking so i've got some questions... What does it mean to have an external vs. internal journal for ZFS? This is my first use of ZFS, so be gentle. External == ZIL on a separate device, e.g. zpool create tank c2t0d0 log c2t1d0 Can you show the output of 'zpool status' when using software RAID vs. hardware RAID for ZFS? I blew away the hardware RAID but here's the one for software: # zpool status pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirrorONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 logsONLINE 0 0 0 c2t6d0ONLINE 0 0 0 errors: No known data errors iostat shows balanced reads and writes across t[0-5], so I assume this is working. The hardware RAID has a cache on the controller. ZFS will flush the cache when pushing out a txg (essentially before writing out the uberblock and after writing out the uberblock). When you have a non- volatile cache with battery backing (such as your setup), its safe to disable that via putting 'set zfs:zfs_nocacheflush = 1' in /etc/ system and rebooting. Do you think this would matter? There's no reason to believe that the RAID controller respects the flush commands, is there? As far as the operating system is concerned, the flush means that data is in non-volatile storage, and the RAID controller's cache/disk configuration is opaque. What parameters did you give bonnie++? compiled 64bit, right? Uh, whoops. As I freely admit this is my first encounter with opensolaris, I just built the software on the assumption that it would be 64-bit by default. But it looks like all my benchmarks were built 32-bit. Yow. I'd better redo them with -m64, eh? [time passes] Well, results are _substantially_ worse with bonnie++ recompiled at 64-bit. Way, way worse. 54MB/s linear reads, 23MB/s linear writes, 33MB/s mixed. For the randomio test, it looks like you used an io_size of 4KB. Are those aligned? random? How big is the '/dev/sdb' file? Randomio does aligned reads and writes. I'm not sure what you mean by /dev/sdb? The file upon which randomio operates is 4GiB. Do you have the parameters given to FFSB? The parameters are linked on my page. Regards, jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ARC vs DULO
Hi all, has an alternative to ARC been considered to improve sequential write IO in zfs? here's a reference for DULO: http://www.usenix.org/event/fast05/tech/full_papers/jiang/jiang_html/dulo-html.html#BG03 sd- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote: On Thu, 2007-08-30 at 12:07 -0700, eric kustarz wrote: Hey jwb, Thanks for taking up the task, its benchmarking so i've got some questions... What does it mean to have an external vs. internal journal for ZFS? This is my first use of ZFS, so be gentle. External == ZIL on a separate device, e.g. zpool create tank c2t0d0 log c2t1d0 Ok, cool!, that's the way to do it. I'm always curious to see if people know about some of the new features in ZFS. (and then there's the game of matching lingo - separate intent log - external journal). So the ZIL will be responsible for handling synchronous operations (O_DYSNC writes, file creates over NFS, fsync, etc). I actually don't see anything in the tests you ran that would stress this aspect (looks like randomio is doing 1% fsyncs). If you did, then you'd want to have more log devices (ie: a stripe of them). Can you show the output of 'zpool status' when using software RAID vs. hardware RAID for ZFS? I blew away the hardware RAID but here's the one for software: Ok, for the hardware RAID config to do a fair comparison, you'd just want to do just a RAID-0 in ZFS, so something like: # zpool create tank c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 We call this dynamic striping in ZFS. # zpool status pool: tank state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 mirrorONLINE 0 0 0 c2t0d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0 mirrorONLINE 0 0 0 c2t4d0 ONLINE 0 0 0 c2t5d0 ONLINE 0 0 0 logsONLINE 0 0 0 c2t6d0ONLINE 0 0 0 errors: No known data errors iostat shows balanced reads and writes across t[0-5], so I assume this is working. Cool, makes sense. The hardware RAID has a cache on the controller. ZFS will flush the cache when pushing out a txg (essentially before writing out the uberblock and after writing out the uberblock). When you have a non- volatile cache with battery backing (such as your setup), its safe to disable that via putting 'set zfs:zfs_nocacheflush = 1' in /etc/ system and rebooting. Do you think this would matter? There's no reason to believe that the RAID controller respects the flush commands, is there? As far as the operating system is concerned, the flush means that data is in non-volatile storage, and the RAID controller's cache/disk configuration is opaque. From my experience dealing with some Hitachi and LSI devices, it makes a big difference (of course depending on the workload). ZFS needs to flush the cache for every transaction group (aka txg) and for ZIL operations. The txg happens about very 5 seconds. The ZIL operations are of course dependent on the workload. So a workload that does lots of synchronous writes will triggers lots of ZIL operations, which will trigger lots of cache flushes. For ZFS, we can safely enable the write cache on a disk - and part of that requires we flush the write cache at specific times. However, syncing the non-volatile cache on a controller (with battery backup) doesn't make sense (and some devices will actually flush their cache), and can really hurt performance for workloads that flush a lot. What parameters did you give bonnie++? compiled 64bit, right? Uh, whoops. As I freely admit this is my first encounter with opensolaris, I just built the software on the assumption that it would be 64-bit by default. But it looks like all my benchmarks were built 32-bit. Yow. I'd better redo them with -m64, eh? [time passes] Well, results are _substantially_ worse with bonnie++ recompiled at 64-bit. Way, way worse. 54MB/s linear reads, 23MB/s linear writes, 33MB/s mixed. Hmm, what are you parameters? For the randomio test, it looks like you used an io_size of 4KB. Are those aligned? random? How big is the '/dev/sdb' file? Randomio does aligned reads and writes. I'm not sure what you mean by /dev/sdb? The file upon which randomio operates is 4GiB. Sorry i was grabbing dev/sb from the http://arctic.org/~dean/ randomio/ link (that was kinda silly). Ok cool, just making sure the file wasn't completely cacheable. Another thing to know about ZFS is that it has a variable block size (that maxes out at 128KB). And since ZFS is COW, we can grow the block size on demand. For instance, if you just create a small file, say 1B, you're block size is 512B. If you go over to 513B, we double you to 1KB, etc. Why it matters here (and you see this especially on databases) is
[zfs-discuss] Please help! ZFS crash burn in SXCE b70!
Hey folks, I've been wanting to use Solaris for a while now, for a ZFS home storage server and simply to get used to Solaris (I like to experiment). However, installing b70 has really not worked out for me at all. The hardware I'm using is pretty simple, but didn't seem to be supported under the latest Nexenta or Belenix build. It seems to work fine in b70 SXCE...with a few catastrophic problems (counterintuitive I know, but hear me out). I haven't yet managed to get the Webstart hardware analyzer to work on this system (no install, ubuntu liveCDs seem to not want to install a jdk for some reason), so I'm really not sure that the hardware is supported, but as I said everything seemed to work fine in the installer and then initializing the ZFS pool (I'd just like to say how shockingly simple it was to create my zpool -- I was amazed!). The hardware is: 3.0ghz P4 socket 775 Intel 965G desktop board (Widowmaker) 3x 400GB SATA drives (ZFS RaidZ) 1x 100GB IDE drive (UFS boot) I added a SI 2 port PCI SATA controller, but it seemed to not be recognized so I am not using it. The problems I'm experiencing are as follows: ZFS creates the storage pool just fine, sees no errors on the drives, and seems to work great...right up until I attempt to put data on the drives. After only a few moments of transfer, things start to go wrong. The system doesn't power off, it just beeps 4-5 times. The X session dies and the monitor turns off (doesn't drop back to a console). All network access dies. It seems that the system panics (is it called something else in solaris-land?). The HD access light stays on (though I can hear no drives doing anything strenuous), and the CD light blinks. This has happened two or three times, every time I've tried to start copying data to the ZFS pool. I've been transfering over the network, via SCP or NFS. This happens every time I've attempted to transfer data to the ZFS storage pool. Data transfers to the UFS partition seemed to work fine, and when I rebooted everything seemed to be working again. When I did a zfs scrub on the storage pool, the system crashed as usual, but didn't come back up properly. It went to a disk cleanup root password prompt (which I couldn't enter because I didn't have USB legacy mode enabled and apparently USB isn't supported until the OS is fully booted and I didn't have a spare PS2 keyboard to use on that system). This is really bothersome, since I really was looking forward to the ease of use and administration of ZFS versus Linux software RAID + LVM. Can anybody shed some light on my situation? Is there any way I can get a little more information about what's causing this crash? I have no problem hooking up a serial console to the system to pull off info if that's possible (provided it has a serial port...I don't really remember) if necessary. Or maybe there are logs stored when the system takes a dive? Anything I can do to help sort this out I'll be willing to do. As a side note, this so far is entirely experimental for me...I haven't even gotten the chance to get any large amount of data on the ZFS pool (~650MB so far), so I have no problem reinstalling, changing around hardware, swapping board processor out for something different (I have several systems with some potential to be good storage servers that I don't mind moving around -- I borrowed 2 or 3 drives from work so that I can move data between stable systems to move around other hardware). Thanks! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Thu, 2007-08-30 at 13:07 -0700, eric kustarz wrote: On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote: Uh, whoops. As I freely admit this is my first encounter with opensolaris, I just built the software on the assumption that it would be 64-bit by default. But it looks like all my benchmarks were built 32-bit. Yow. I'd better redo them with -m64, eh? [time passes] Well, results are _substantially_ worse with bonnie++ recompiled at 64-bit. Way, way worse. 54MB/s linear reads, 23MB/s linear writes, 33MB/s mixed. Hmm, what are you parameters? bonnie++ -g daemon -d /tank/bench/ -f This becomes more interesting. The very slow numbers above were on an aged (post-benchmark) filesystem. After destroying and recreating the zpool, the numbers are similar to the originals (55/87/37). Does ZFS really age that quickly? I think I need to do more investigating here. For the randomio test, it looks like you used an io_size of 4KB. Are those aligned? random? How big is the '/dev/sdb' file? Randomio does aligned reads and writes. I'm not sure what you mean by /dev/sdb? The file upon which randomio operates is 4GiB. Another thing to know about ZFS is that it has a variable block size (that maxes out at 128KB). And since ZFS is COW, we can grow the block size on demand. For instance, if you just create a small file, say 1B, you're block size is 512B. If you go over to 513B, we double you to 1KB, etc. # zfs set recordsize=2K tank/bench # randomio bigfile 10 .25 .01 2048 60 1 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 463.9 | 346.8 0.0 21.6 761.9 33.7 | 117.1 0.0 21.3 883.9 33.5 Roughly the same as when the RS was 128K. But, if I set the RS to 2K before creating bigfile: total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 614.7 | 460.4 0.0 18.5 249.3 14.2 | 154.4 0.09.6 989.0 27.6 Much better! Yay! So I assume you would always set RS=8K when using PostgreSQL, etc? -jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
Jeffrey W. Baker wrote: # zfs set recordsize=2K tank/bench # randomio bigfile 10 .25 .01 2048 60 1 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 463.9 | 346.8 0.0 21.6 761.9 33.7 | 117.1 0.0 21.3 883.9 33.5 Roughly the same as when the RS was 128K. But, if I set the RS to 2K before creating bigfile: total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 614.7 | 460.4 0.0 18.5 249.3 14.2 | 154.4 0.09.6 989.0 27.6 Much better! Yay! So I assume you would always set RS=8K when using PostgreSQL, etc? I presume these are something like Seagate DB35.3 series SATA 400 GByte drives? If so, then the spec'ed average read seek time is 11 ms and rotational delay is 7,200 rpm. So the theoretical peak random read rate per drive is ~66 iops. http://www.seagate.com/ww/v/index.jsp?vgnextoid=01117ea70fafd010VgnVCM10dd04090aRCRDlocale=en-US# For an 8-disk mirrored set, the max theoretical random read rate is 527 iops. I see you're getting 460, so you're at 87% of theoretical. Not bad. When writing, the max theoretical rate is a little smaller because of the longer seek time (see datasheet) so we can get ~62 iops per disk. Also, the total is divided in half because we have to write to both sides of the mirror. Thus the peak is 248 iops. You see 154 or 62% of peak. Not quite so good. But there is another behaviour here which is peculiar to ZFS. All writes are COW and allocated from free space. But this is done in 1 MByte chunks. For 2 kByte I/Os, that means you need to get to a very high rate before the workload is spread out across all of the disks simultaneously. You should be able to see this if you look at iostat with a small interval. For 8 kByte recordsize, you should see that it is easier to spread the wealth across all 4 mirrored pairs. For other RAID systems, you can vary the stripe interlace, usually to much smaller values, to help spread the wealth. It is difficult to predict how this will affect your application performance, though. For simultaneous reads and writes, 614 iops is pretty decent, but it makes me wonder if the spread is much smaller than the full disk. If the application only does 8 kByte iops, then I wouldn't even bother doing large, sequential workload testing... you'll never be able to approach that limit before you run out of some other resource, usually CPU or controller. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash burn in SXCE b70!
The problems I'm experiencing are as follows: ZFS creates the storage pool just fine, sees no errors on the drives, and seems to work great...right up until I attempt to put data on the drives. After only a few moments of transfer, things start to go wrong. The system doesn't power off, it just beeps 4-5 times. The X session dies and the monitor turns off (doesn't drop back to a console). All network access dies. It seems that the system panics (is it called something else in solaris-land?). The HD access light stays on (though I can hear no drives doing anything strenuous), and the CD light blinks. This has happened two or three times, every time I've tried to start copying data to the ZFS pool. I've been transfering over the network, via SCP or NFS. This could be a hardware problem. Bad powersuply for the load? Try removing 2 of the large disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Please help! ZFS crash burn in SXCE b70!
Nigel Smith wrote: Are you sure your hardware is working without problems? I would first check the RAM with memtest86+ http://www.memtest.org/ Also, SunVTS should be in /usr/sunvts and includes memory and disk tests (plus others). This is the test suite we (Sun) use in manufacturing. Take care when using destructive tests :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, XFS, and EXT4 compared
On Thu, 2007-08-30 at 15:28 -0700, Richard Elling wrote: Jeffrey W. Baker wrote: # zfs set recordsize=2K tank/bench # randomio bigfile 10 .25 .01 2048 60 1 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 463.9 | 346.8 0.0 21.6 761.9 33.7 | 117.1 0.0 21.3 883.9 33.5 Roughly the same as when the RS was 128K. But, if I set the RS to 2K before creating bigfile: total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 614.7 | 460.4 0.0 18.5 249.3 14.2 | 154.4 0.09.6 989.0 27.6 Much better! Yay! So I assume you would always set RS=8K when using PostgreSQL, etc? I presume these are something like Seagate DB35.3 series SATA 400 GByte drives? If so, then the spec'ed average read seek time is 11 ms and rotational delay is 7,200 rpm. So the theoretical peak random read rate per drive is ~66 iops. http://www.seagate.com/ww/v/index.jsp?vgnextoid=01117ea70fafd010VgnVCM10dd04090aRCRDlocale=en-US# 400GB 7200.10, which have slightly better seek specs. For an 8-disk mirrored set, the max theoretical random read rate is 527 iops. I see you're getting 460, so you're at 87% of theoretical. Not bad. When writing, the max theoretical rate is a little smaller because of the longer seek time (see datasheet) so we can get ~62 iops per disk. Also, the total is divided in half because we have to write to both sides of the mirror. Thus the peak is 248 iops. You see 154 or 62% of peak. I think this line of reasoning is a bit misleading, since the reads and the writes are happening simultaneously, with a ratio of 3:1 in favor of writes, and 1% of all writes followed by an fsync. With all writes and no fsyncs, it's more like this: iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 364.1 |0.0 Inf -NaN0.0 -NaN | 364.1 0.0 27.4 1795.8 69.3 Which is altogether respectable. For simultaneous reads and writes, 614 iops is pretty decent, but it makes me wonder if the spread is much smaller than the full disk. Sure it is. 4GiB 1.2TiB. If I spread it out over 128GiB, it's much slower, but it seems that would apply to any filesystem. 190.8 | 143.4 0.0 53.4 254.4 26.6 | 47.4 3.6 49.4 558.8 29.4 -jwb ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Strange ZFS timeouts
I'm seeing some odd i/o behaviour on a Sun Fire running snv_70, connected via 4gb FC to some passthrough disks for a ZFS pool. The system is normally not heavily loaded, so I don't pay as much attention to I/O performance as I should, but recently we had several drives fail checksums (heat event) and so we've been putting ZFS through it paces on resilvers from spare drives. However, zpool iostat is being somewhat confusing, as it is showing frequent, longish periods when no i/o is going on. A very similarly configured box on the same FC fabric (running snv_72 tagged as of Aug 21) does not exhibit the timeout behaviour. There are a few scsi timeouts in logs, but not even remotely enough to account for the ZFS timeouts I'm seeing. Really, I'm just looking for ideas on where to start debugging what might be causing the problem (which results in some really very silly resilver times). Config information and sample iostats follow. Thanks, Jeff -- Jeff Bachtel ([EMAIL PROTECTED],TAMU) http://www.cepheid.org/~jeff The sciences, each straining in [finger [EMAIL PROTECTED] for PGP key] its own direction, have hitherto harmed us little; - HPL, TCoC (Good pool's zpool status. Error file is a dangling reference to a bad file that I've deleted. Member disks 400gb) imsfs-mirror:~ sudo zpool status -xv pool: ims_pool_mirror state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress, 11.02% done, 15h2m to go config: NAMESTATE READ WRITE CKSUM ims_pool_mirror ONLINE 0 0 0 raidz2ONLINE 0 0 0 c2t2104D9600099d14 ONLINE 0 0 0 c2t2104D9600099d1 ONLINE 0 0 0 c2t2104D9600099d2 ONLINE 0 0 0 c2t2104D9600099d3 ONLINE 0 0 0 c2t2104D9600099d15 ONLINE 0 0 0 c2t2104D9600099d5 ONLINE 0 0 0 c2t2104D9600099d6 ONLINE 0 0 0 c2t2104D9600099d7 ONLINE 0 0 0 c2t2104D9600099d8 ONLINE 0 0 0 c2t2104D9600099d9 ONLINE 0 0 0 c2t2104D9600099d10 ONLINE 0 0 0 c2t2104D9600099d11 ONLINE 0 0 0 c2t2104D9600099d12 ONLINE 0 0 0 c2t2104D9600099d13 ONLINE 0 0 0 spares c2t2104D9600099d0 AVAIL c2t2104D9600099d4 AVAIL errors: Permanent errors have been detected in the following files: ims_pool_mirror/backup/vprweb:0xcad (Good pool's zpool iostat 1 50) imsfs-mirror:~ sudo zpool iostat 1 50 capacity operationsbandwidth pool used avail read write read write --- - - - - - - ims_pool_mirror 3.84T 1.25T574 21 65.7M 351K ims_pool_mirror 3.84T 1.25T458 0 55.3M 0 ims_pool_mirror 3.84T 1.25T389 0 47.4M 0 ims_pool_mirror 3.84T 1.25T532 0 64.2M 0 ims_pool_mirror 3.84T 1.25T650 0 79.3M 0 ims_pool_mirror 3.84T 1.25T391 0 47.6M 0 ims_pool_mirror 3.84T 1.25T548 0 66.2M 0 ims_pool_mirror 3.84T 1.25T462 0 56.1M 0 ims_pool_mirror 3.84T 1.25T492 0 59.5M 0 ims_pool_mirror 3.84T 1.25T488 0 59.6M 0 ims_pool_mirror 3.84T 1.25T619 0 75.0M 0 ims_pool_mirror 3.84T 1.25T430 0 52.2M 0 ims_pool_mirror 3.84T 1.25T467 0 57.1M 0 ims_pool_mirror 3.84T 1.25T463 0 56.3M 0 ims_pool_mirror 3.84T 1.25T547 0 66.8M 0 ims_pool_mirror 3.84T 1.25T513 0 62.2M 0 ims_pool_mirror 3.84T 1.25T449 0 54.5M 0 ims_pool_mirror 3.84T 1.25T445 0 53.6M 0 ims_pool_mirror 3.84T 1.25T501 0 61.4M 0 ims_pool_mirror 3.84T 1.25T558 0 68.1M 0 ims_pool_mirror 3.84T 1.25T718 0 87.5M 0 ims_pool_mirror 3.84T 1.25T385 0 47.0M 0 ims_pool_mirror 3.84T 1.25T415 0 50.2M 0 ims_pool_mirror 3.84T 1.25T626 0 76.1M 0 ims_pool_mirror 3.84T 1.25T579 0 70.6M 0 ims_pool_mirror 3.84T 1.25T516 0 62.9M 0 ims_pool_mirror 3.84T 1.25T465 0 56.5M 0 ims_pool_mirror 3.84T 1.25T601 0 73.2M 0 ims_pool_mirror 3.84T 1.25T361 0 44.5M 0 ims_pool_mirror 3.84T 1.25T335 0 40.0M 0 ims_pool_mirror 3.84T 1.25T