Re: [zfs-discuss] partioned cache devices
On Mar 16, 2013, at 7:01 PM, Andrew Werchowiecki andrew.werchowie...@xpanse.com.au wrote: It's a home set up, the performance penalty from splitting the cache devices is non-existant, and that work around sounds like some pretty crazy amount of overhead where I could instead just have a mirrored slog. I'm less concerned about wasted space, more concerned about amount of SAS ports I have available. I understand that p0 refers to the whole disk... in the logs I pasted in I'm not attempting to mount p0. I'm trying to work out why I'm getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I've gone over the steps trying to make sure I haven't missed something but can't see a fault. You can have only one Solaris partition at a time. Ian already shared the answer, Create one 100% Solaris partition and then use format to create two slices. -- richard I'm not keen on using Solaris slices because I don't have an understanding of what that does to the pool's OS interoperability. From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [opensolarisisdeadlongliveopensola...@nedharvey.com] Sent: Friday, 15 March 2013 8:44 PM To: Andrew Werchowiecki; zfs-discuss@opensolaris.org Subject: RE: partioned cache devices From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Andrew Werchowiecki muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 Password: cannot open '/dev/dsk/c25t10d1p2': I/O error muslimwookie@Pyzee:~$ I have two SSDs in the system, I've created an 8gb partition on each drive for use as a mirrored write cache. I also have the remainder of the drive partitioned for use as the read only cache. However, when attempting to add it I get the error above. Sounds like you're probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2. If you create one big solaris fdisk parititon and then slice it via partition where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6 Generally speaking, it's unadvisable to split the slog/cache devices anyway. Because: If you're splitting it, evidently you're focusing on the wasted space. Buying an expensive 128G device where you couldn't possibly ever use more than 4G or 8G in the slog. But that's not what you should be focusing on. You should be focusing on the speed (that's why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog. You have a mirror, you say. You should probably drop both the cache log. Use one whole device for the cache, use one whole device for the log. The only risk you'll run is: Since a slog is write-only (except during mount, typically at boot) it's possible to have a failure mode where you think you're writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there's an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool. I've never heard of anyone actually being that paranoid, and I've never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical. Mirroring the slog device really isn't necessary in the modern age. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabyte pool?
On Mar 15, 2013, at 6:09 PM, Marion Hakanson hakan...@ohsu.edu wrote: Greetings, Has anyone out there built a 1-petabyte pool? Yes, I've done quite a few. I've been asked to look into this, and was told low performance is fine, workload is likely to be write-once, read-occasionally, archive storage of gene sequencing data. Probably a single 10Gbit NIC for connectivity is sufficient. We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis, using 4TB nearline SAS drives, giving over 100TB usable space (raidz3). Back-of-the-envelope might suggest stacking up eight to ten of those, depending if you want a raw marketing petabyte, or a proper power-of-two usable petabyte. Yes. NB, for the PHB, using N^2 is found 2B less effective than N^10. I get a little nervous at the thought of hooking all that up to a single server, and am a little vague on how much RAM would be advisable, other than as much as will fit (:-). Then again, I've been waiting for something like pNFS/NFSv4.1 to be usable for gluing together multiple NFS servers into a single global namespace, without any sign of that happening anytime soon. NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace without needing the complexity of NFSv4.1, lustre, glusterfs, etc. So, has anyone done this? Or come close to it? Thoughts, even if you haven't done it yourself? Don't forget about backups :-) -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Feb 26, 2013, at 12:33 AM, Tiernan OToole lsmart...@gmail.com wrote: Thanks all! I will check out FreeNAS and see what it can do... I will also check my RAID Card and see if it can work with JBOD... fingers crossed... The machine has a couple internal SATA ports (think there are 2, could be 4) so i was thinking of using those for boot disks and SSDs later... As a follow up question: Data Deduplication: The machine, to start, will have about 5Gb RAM. I read somewhere that 20TB storage would require about 8GB RAM, depending on block size... Since i dont know block sizes, yet (i store a mix of VMs, TV Shows, Movies and backups on the NAS) Consider using different policies for different data. For traditional file systems, you had relatively few policy options: readonly, nosuid, quota, etc. With ZFS, dedup and compression are also policy options. In your case, dedup for your media is not likely to be a good policy, but dedup for your backups could be a win (unless you're using something that already doesn't backup duplicate data -- eg most backup utilities). A way to approach this is to think of your directory structure and create file systems to match the policies. For example: /home/richard = compressed (default top-level, since properties are inherited) /home/richard/media = compressed /home/richard/backup = compressed + dedup -- richard I am not sure how much memory i will need (my estimate is 10TB RAW (8TB usable?) in a ZRAID1 pool, and then 3TB RAW in a striped pool). If i dont have enough memory now, can i enable DeDupe at a later stage when i add memory? Also, if i pick FreeBSD now, and want to move to, say, Nexenta, is that possible? Assuming the drives are just JBOD drives (to be confirmed) could they just get imported? Thanks. On Mon, Feb 25, 2013 at 6:11 PM, Tim Cook t...@cook.ms wrote: On Mon, Feb 25, 2013 at 7:57 AM, Volker A. Brandt v...@bb-c.de wrote: Tim Cook writes: I need something that will allow me to share files over SMB (3 if possible), NFS, AFP (for Time Machine) and iSCSI. Ideally, i would like something i can manage easily and something that works with the Dell... All of them should provide the basic functionality you're looking for. None of them will provide SMB3 (at all) or AFP (without a third party package). FreeNAS has AFP built-in, including a Time Machine discovery method. The latest FreeNAS is still based on Samba 3.x, but they are aware of 4.x and will probably integrate it at some point in the future. Then you should have SMB3. I don't know how far along they are... Best regards -- Volker FreeNAS comes with a package pre-installed to add AFP support. There is no native AFP support in FreeBSD and by association FreeNAS. --Tim -- Tiernan O'Toole blog.lotas-smartman.net www.geekphotographer.com www.tiernanotoole.ie ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot destroy, volume is busy
On Feb 21, 2013, at 8:02 AM, John D Groenveld jdg...@elvis.arl.psu.edu wrote: # zfs list -t vol NAME USED AVAIL REFER MOUNTPOINT rpool/dump4.00G 99.9G 4.00G - rpool/foo128 66.2M 100G16K - rpool/swap4.00G 99.9G 4.00G - # zfs destroy rpool/foo128 cannot destroy 'rpool/foo128': volume is busy I checked that the volume is not a dump or swap device and that iSCSI is disabled. The iSCSI service is not STMF. STMF will need to be disabled, or the volume no longer used by STMF. iSCSI service is svc:/network/iscsi/target:default STMF service is svc:/system/stmf:default On Solaris 11.1, how would I determine what's busying it? One would think that fuser would work, but in my experience, fuser rarely does what I expect. If you suspect STMF, then try stmfadm list-lu -v -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?
On Feb 20, 2013, at 2:49 PM, Markus Grundmann mar...@freebsduser.eu wrote: Hi! My name is Markus and I living in germany. I'm new to this list and I have a simple question related to zfs. My favorite operating system is FreeBSD and I'm very happy to use zfs on them. It's possible to enhance the properties in the current source tree with an entry like protected? I find it seems not to be difficult but I'm not an professional C programmer. For more information please take a little bit of time and read my short post at http://forums.freebsd.org/showthread.php?t=37895 I have reviewed some pieces of the source code in FreeBSD 9.1 to find out how difficult it was to add an pool / filesystem property as an additional security layer for administrators. Whenever I modify zfs pools or filesystems it's possible to destroy [on a bad day :-)] my data. A new property protected=on|off in the pool and/or filesystem can help the administrator for datalost (e.g. zpool destroy tank or zfs destroy tank/filesystem command will be rejected when protected=on property is set). Look at the delegable properties (zfs allow). For example, you can delegate a user to have specific privileges and then not allow them to destroy. Note: I'm only 99% sure this is implemented in FreeBSD, hopefully someone can verify. -- richard It's anywhere here on this list their can discuss/forward this feature request? I hope you have understand my post ;-) Thanks and best regards, Markus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?
On Feb 20, 2013, at 3:27 PM, Tim Cook t...@cook.ms wrote: On Wed, Feb 20, 2013 at 5:09 PM, Richard Elling richard.ell...@gmail.com wrote: On Feb 20, 2013, at 2:49 PM, Markus Grundmann mar...@freebsduser.eu wrote: Hi! My name is Markus and I living in germany. I'm new to this list and I have a simple question related to zfs. My favorite operating system is FreeBSD and I'm very happy to use zfs on them. It's possible to enhance the properties in the current source tree with an entry like protected? I find it seems not to be difficult but I'm not an professional C programmer. For more information please take a little bit of time and read my short post at http://forums.freebsd.org/showthread.php?t=37895 I have reviewed some pieces of the source code in FreeBSD 9.1 to find out how difficult it was to add an pool / filesystem property as an additional security layer for administrators. Whenever I modify zfs pools or filesystems it's possible to destroy [on a bad day :-)] my data. A new property protected=on|off in the pool and/or filesystem can help the administrator for datalost (e.g. zpool destroy tank or zfs destroy tank/filesystem command will be rejected when protected=on property is set). Look at the delegable properties (zfs allow). For example, you can delegate a user to have specific privileges and then not allow them to destroy. Note: I'm only 99% sure this is implemented in FreeBSD, hopefully someone can verify. -- richard With the version of allow I'm looking at, unless I'm missing a setting, it looks like it'd be a complete nightmare. I see no concept of deny, so that means you either have to give *everyone* all permissions besides delete, or you have to go through every user/group on the box and give specific permissions and on top of not allowing destroy. And then if you change your mind later you have to go back through and give everyone you want to have that feature access to it. That seems like a complete PITA to me. :-) they don't call it idiot-proofing for nothing! :-) But seriously, one of the first great zfs-discuss wars was over the request for a -f flag for destroy. The result of the research showed that if you typed destroy then you meant it, and adding a -f flag just teaches you to type destroy -f instead. See also kill -9 -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs-discuss mailing list opensolaris EOL
On Feb 16, 2013, at 10:16 PM, Bryan Horstmann-Allen b...@mirrorshades.net wrote: +-- | On 2013-02-17 18:40:47, Ian Collins wrote: | One of its main advantages is it has been platform agnostic. We see Solaris, Illumos, BSD and more recently ZFS on Linux questions all give the same respect. I do hope we can get another, platform agnostic, home for this list. As the guy who provides the illumos mailing list services, and as someone who has deeply vested interests in seeing ZFS thrive on all platforms, I'm happy to suggest that we'd welcome all comers on z...@lists.illumos.org. +1 -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to know available disk space
On Feb 6, 2013, at 5:17 PM, Gregg Wonderly gregg...@gmail.com wrote: This is one of the greatest annoyances of ZFS. I don't really understand how, a zvol's space can not be accurately enumerated from top to bottom of the tree in 'df' output etc. Why does a zvol divorce the space used from the root of the volume? Thick (with reservation) or thin provisioning behave differently. Also, depending on how you created the reservation, it might or might not account for the metadata overhead needed. By default, space for metadata is reserved, but if you use the -s (sparse aka thin provisioning) option, then later reservation changes are set as absolute. Also, metadata space, compression, copies, and deduplication must be accounted for. The old notions of free/available don't match very well with these modern features. Gregg Wonderly On Feb 6, 2013, at 5:26 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: I have a bunch of VM's, and some samba shares, etc, on a pool. I created the VM's using zvol's, specifically so they would have an appropriaterefreservation and never run out of disk space, even with snapshots. Today, I ran out of disk space, and all the VM's died. So obviously it didn't work. When I used zpool status after the system crashed, I saw this: NAME SIZE ALLOC FREE EXPANDSZCAP DEDUP HEALTH ALTROOT storage 928G 568G 360G -61% 1.00x ONLINE - I did some cleanup, so I could turn things back on ... Freed up about 4G. Now, when I use zpool status I see this: NAME SIZE ALLOC FREE EXPANDSZCAP DEDUP HEALTH ALTROOT storage 928G 564G 364G -60% 1.00x ONLINE - When I use zfs list storage I see this: NAME USED AVAIL REFER MOUNTPOINT storage 909G 4.01G 32.5K /storage So I guess the lesson is (a) refreservation and zvol alone aren't enough to ensure your VM's will stay up. and (b) if you want to know how much room is *actually* available, as in usable, as in, how much can I write before I run out of space, you should use zfs list and not zpoolstatus Correct. zpool status does not show dataset space available. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Jan 29, 2013, at 6:08 AM, Robert Milkowski rmilkow...@task.gda.pl wrote: From: Richard Elling Sent: 21 January 2013 03:51 VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product, but the CEO made a conscious (and unpopular) decision to keep that code from the community. Over the summer, another developer picked up the work in the community, but I've lost track of the progress and haven't seen an RTI yet. That is one thing that always bothered me... so it is ok for others, like Nexenta, to keep stuff closed and not in open, while if Oracle does it they are bad? Nexenta is just as bad. For the record, the illumos-community folks who worked at Nexenta at the time were overruled by executive management. Some of those folks are now executive management elsewhere :-) Isn't it at least a little bit being hypocritical? (bashing Oracle and doing sort of the same) No, not at all. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com wrote: But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream... I disagree the ZFS is developmentally challenged. There is more development now than ever in every way: # of developers, companies, OSes, KLOCs, features. Perhaps the level of maturity makes progress appear to be moving slower than it seems in early life? -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Jan 20, 2013, at 4:51 PM, Tim Cook t...@cook.ms wrote: On Sun, Jan 20, 2013 at 6:19 PM, Richard Elling richard.ell...@gmail.com wrote: On Jan 20, 2013, at 8:16 AM, Edward Harvey imaginat...@nedharvey.com wrote: But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream... I disagree the ZFS is developmentally challenged. There is more development now than ever in every way: # of developers, companies, OSes, KLOCs, features. Perhaps the level of maturity makes progress appear to be moving slower than it seems in early life? -- richard Well, perhaps a part of it is marketing. A lot of it is marketing :-/ Maturity isn't really an excuse for not having a long-term feature roadmap. It seems as though maturity in this case equals stagnation. What are the features being worked on we aren't aware of? Most of the illumos-centric discussion is on the developer's list. The ZFSonLinux and BSD communities are also quite active. Almost none of the ZFS developers hang out on this zfs-discuss@opensolaris.org anymore. In fact, I wonder why I'm still here... The big ones that come to mind that everyone else is talking about for not just ZFS but openindiana as a whole and other storage platforms would be: 1. SMB3 - hyper-v WILL be gaining market share over the next couple years, not supporting it means giving up a sizeable portion of the market. Not to mention finally being able to run SQL (again) and Exchange on a fileshare. I know of at least one illumos community company working on this. However, I do not know their public plans. 2. VAAI support. VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product, but the CEO made a conscious (and unpopular) decision to keep that code from the community. Over the summer, another developer picked up the work in the community, but I've lost track of the progress and haven't seen an RTI yet. 3. the long-sought bp-rewrite. Go for it! 4. full drive encryption support. This is a key management issue mostly. Unfortunately, the open source code for handling this (trousers) covers much more than keyed disks and can be unwieldy. I'm not sure which distros picked up trousers, but it doesn't belong in the illumos-gate and it doesn't expose itself to ZFS. 5. tiering (although I'd argue caching is superior, it's still a checkbox). You want to add tiering to the OS? That has been available for a long time via the (defunct?) SAM-QFS project that actually delivered code http://hub.opensolaris.org/bin/view/Project+samqfs/ If you want to add it to ZFS, that is a different conversation. -- richard There's obviously more, but those are just ones off the top of my head that others are supporting/working on. Again, it just feels like all the work is going into fixing bugs and refining what is there, not adding new features. Obviously Saso personally added features, but overall there don't seem to be a ton of announcements to the list about features that have been added or are being actively worked on. It feels like all these companies are just adding niche functionality they need that may or may not be getting pushed back to mainline. /debbie-downer -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 19, 2013, at 7:16 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Oh, I forgot to mention - The above logic only makes sense for mirrors and stripes. Not for raidz (or raid-5/6/dp in general) If you have a pool of mirrors or stripes, the system isn't forced to subdivide a 4k block onto multiple disks, so it works very well. But if you have a pool blocksize of 4k and let's say a 5-disk raidz (capacity of 4 disks) then the 4k block gets divided into 1k on each disk and 1k parity on the parity disk. Now, since the hardware only supports block sizes of 4k ... You can see there's a lot of wasted space, and if you do a bunch of it, you'll also have a lot of wasted time waiting for seeks/latency. This is not quite true for raidz. If there is a 4k write to a raidz comprised of 4k sector disks, then there will be one data and one parity block. There will not be 4 data + 1 parity with 75% space wastage. Rather, the space allocation more closely resembles a variant of mirroring, like some vendors call RAID-1E -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
bloom filters are a great fit for this :-) -- richard On Jan 19, 2013, at 5:59 PM, Nico Williams n...@cryptonector.com wrote: I've wanted a system where dedup applies only to blocks being written that have a good chance of being dups of others. I think one way to do this would be to keep a scalable Bloom filter (on disk) into which one inserts block hashes. To decide if a block needs dedup one would first check the Bloom filter, then if the block is in it, use the dedup code path, else the non-dedup codepath and insert the block in the Bloom filter. This means that the filesystem would store *two* copies of any deduplicatious block, with one of those not being in the DDT. This would allow most writes of non-duplicate blocks to be faster than normal dedup writes, but still slower than normal non-dedup writes: the Bloom filter will add some cost. The nice thing about this is that Bloom filters can be sized to fit in main memory, and will be much smaller than the DDT. It's very likely that this is a bit too obvious to just work. Of course, it is easier to just use flash. It's also easier to just not dedup: the most highly deduplicatious data (VM images) is relatively easy to manage using clones and snapshots, to a point anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 9:35 PM, Thomas Nau thomas@uni-ulm.de wrote: Thanks for all the answers more inline) On 01/18/2013 02:42 AM, Richard Elling wrote: On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us mailto:bfrie...@simple.dallas.tx.us wrote: On Wed, 16 Jan 2013, Thomas Nau wrote: Dear all I've a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. 4k might be a little small. 8k will have less metadata overhead. In some cases we've seen good performance on these workloads up through 32k. Real pain is felt at 128k :-) My only pain so far is the time a send/receive takes without really loading the network at all. VM performance is nothing I worry about at all as it's pretty good. So key question for me is if going from 8k to 16k or even 32k would have some benefit for that problem? send/receive can bottleneck on the receiving side. Take a look at the archives searching for mbuffer as a method of buffering on the receive side. In a well tuned system, the send will be from ARC :-) -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 18, 2013, at 4:40 AM, Jim Klimov jimkli...@cos.ru wrote: On 2013-01-18 06:35, Thomas Nau wrote: If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. 4k might be a little small. 8k will have less metadata overhead. In some cases we've seen good performance on these workloads up through 32k. Real pain is felt at 128k :-) My only pain so far is the time a send/receive takes without really loading the network at all. VM performance is nothing I worry about at all as it's pretty good. So key question for me is if going from 8k to 16k or even 32k would have some benefit for that problem? I would guess that increasing the block size would on one hand improve your reads - due to more userdata being stored contiguously as part of one ZFS block - and thus sending of the backup streams should be more about reading and sending the data and less about random seeking. There is too much caching in the datapath to make a broad statement stick. Empirical measurements with your workload will need to choose the winner. On the other hand, this may likely be paid off with the need to do more read-modify-writes (when larger ZFS blocks are partially updated with the smaller clusters in the VM's filesystem) while the overall system is running and used for its primary purpose. However, since the guest FS is likely to store files of non-minimal size, it is likely that the whole larger backend block would be updated anyway... For many ZFS implementations, RMW for zvols is the norm. So, I think, this is something an experiment can show you - whether the gain during backup (and primary-job) reads vs. possible degradation during the primary-job writes would be worth it. As for the experiment, I guess you can always make a ZVOL with different recordsize, DD data into it from the production dataset's snapshot, and attach the VM or its clone to the newly created clone of its disk image. In my experience, it is very hard to recreate in the lab the environments found in real life. dd, in particular, will skew the results a bit because it is in LBA order for zvols, not the creation order as seen in the real world. That said, trying to get high performance out of HDDs is an exercise like fighting the tides :-) -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 16 Jan 2013, Thomas Nau wrote: Dear all I've a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. 4k might be a little small. 8k will have less metadata overhead. In some cases we've seen good performance on these workloads up through 32k. Real pain is felt at 128k :-) [ stuff removed ] For disaster recovery we plan to sync the pool as often as possible to a remote location. Running send/receive after a day or so seems to take a significant amount of time wading through all the blocks and we hardly see network average traffic going over 45MB/s (almost idle 1G link). So here's the question: would increasing/decreasing the volblocksize improve the send/receive operation and what influence might show for the iSCSI side? Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. compression is a good win, too -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 8:35 AM, Jim Klimov jimkli...@cos.ru wrote: On 2013-01-17 16:04, Bob Friesenhahn wrote: If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. Also, it would make sense while you are at it to verify that the clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that their partitions start at a 512b-based sector offset divisible by 8 inside the virtual HDDs, and the FS headers also align to that so the first cluster is 4KB-aligned. This is the classical expectation. So I added an alignment check into nfssvrtop and iscsisvrtop. I've looked at a *ton* of NFS workloads from ESX and, believe it or not, alignment doesn't matter at all, at least for the data I've collected. I'll let NetApp wallow in the mire of misalignment while I blissfully dream of other things :-) Classic MSDOS MBR did not warrant that partition start, by using 63 sectors as the cylinder size and offset factor. Newer OSes don't use the classic layout, as any config is allowable; and GPT is well aligned as well. Overall, a single IO in the VM guest changing a 4KB cluster in its FS should translate to one 4KB IO in your backend storage changing the dataset's userdata (without reading a bigger block and modifying it with COW), plus some avalanche of metadata updates (likely with the COW) for ZFS's own bookkeeping. I've never seen a 1:1 correlation from the VM guest to the workload on the wire. To wit, I did a bunch of VDI and VDI-like (small, random writes) testing on XenServer and while the clients were chugging away doing 4K random I/Os, on the wire I was seeing 1MB NFS writes. In part this analysis led to my cars-and-trains analysis. In some VMware configurations, over the wire you could see a 16k read for every 4k random write. Go figure. Fortunately, those 16k reads find their way into the MFU side of the ARC :-) Bottom line: use tools like iscsisvrtop and dtrace to get an idea of what is really happening over the wire. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP Proliant DL360 G7
On Jan 8, 2013, at 10:30 AM, Edmund White ewwh...@mac.com wrote: The D2600 and D2700 enclosures are fully supported as Nexenta JBODs. I run them in multiple production environments. Yes, I worked on the field qualifications for these… very nice JBODs :-) I *could* use an HP-branded LSI controller (SC08Ge), but I prefer the higher performance of the LSI 9211 and 9205e HBA's. Many of the big-box vendors have to deal with Windows as the target OS. Until Server 2012, the use of JBODs with lots of disks was challenging for Windows. Hence, they offer few options for the folks who want JBOD control. -- richard I recently posted on Server Fault with the Nexenta console representation of the HP D2700 JBOD. It's already integrated with NexentaStor. -- Edmund White ewwh...@mac.com From: Mark - carne...@gmail.com Date: Tuesday, January 8, 2013 12:09 PM To: Sašo Kiselkov skiselkov...@gmail.com Cc: zfs-discuss@opensolaris.org zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] HP Proliant DL360 G7 Good call Saso. Sigh... I guess I wait to hear from HP on supported IT mode HBAs in their D2000s or other jbods. On Tue, Jan 8, 2013 at 11:40 AM, Sašo Kiselkov skiselkov...@gmail.com wrote: On 01/08/2013 04:27 PM, mark wrote: On Jul 2, 2012, at 7:57 PM, Richard Elling wrote: FYI, HP also sells an 8-port IT-style HBA (SC-08Ge), but it is hard to locate with their configurators. There might be a more modern equivalent cleverly hidden somewhere difficult to find. -- richard Richard, Do you know if the HBAs in HP controllers be swapped out with any well characterized (by nexenta) HBAs like the 9211-8e or do they require a specific 'controller HBA' like the SC-08Ge? IE, does it void the warranty if you open up the controller and stick a third party card in there? Did you ever try to 'bypass' the controllers at all and just plug into an expander? I prefer HP hardware also but the controller is getting in the way. Ill be asking HP the same questions in the next few weeks with any luck but your opinion and experiences are on another level compared to HPs pre-sales department... not that theyre bad but in this realm youre the man :) I know you didn't ask me, but I can tell you my experience: it depends on what you mean by warranty. If you mean as in warranty on sales of goods (as mandated by law), then no, sticking a different HBA in your servers does not void your warranty (unless this is expressly labeled on the product - manufacturers typically also put protective labels on screws then). When it comes to support services, though, such as phone support and firmware updates, then yes, using a third-party HBA can make these difficult and/or impossible. HP storage enclosure and drive firmware, for example, can only be flashed through an HP-branded SmartArray card. Depending on what software you are running on the machines it can make no difference at all, or a lot of difference. For instance, if you're running proprietary storage controller software on the server (think something like NexentaStor, but from the HW vendor), then your custom HBA might simply be flat out unsupported and the only response you'll get from the vendor support team is stick the card we shipped it with back in. OTOH if you're running something not HW vendor-specific (like the aforementioned NexentaStor, or any other Illumos variant), and the HW vendor at least gives lip service to supporting your platform (always tell the support folk you're running Solaris), then chances are that your support contract will be just as valid as before. I've had drives fail on Dell machines and each time support was happy when I just told them drive dead, running Solaris, here's the log output, send a new one please. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt_sas multipath problem?
On Jan 7, 2013, at 1:20 PM, Marion Hakanson hakan...@ohsu.edu wrote: Greetings, We're trying out a new JBOD here. Multipath (mpxio) is not working, and we could use some feedback and/or troubleshooting advice. Sometimes the mpxio detection doesn't work properly. You can try to whitelist them, https://www.illumos.org/issues/644 -- richard The OS is oi151a7, running on an existing server with a 54TB pool of internal drives. I believe the server hardware is not relevant to the JBOD issue, although the internal drives do appear to the OS with multipath device names (despite the fact that these internal drives are cabled up in a single-path configuration). If anything, this does confirm that multipath is enabled in mpt_sas.conf via the mpxio-disable=no directive (internal HBA's are LSI SAS, 2x 9201-16i and 1x 9211-8i). The JBOD is a SuperMicro 847E26-RJBOD1, with the front backplane daisy-chained to the rear backplane (both expanders). Each of the two expander chains is connected to one port of an LSI SAS 9200-8e HBA. So far, all this hardware has appeared as working for others and well-supported, and this 9200-8e is running the -IT firmware, version 15.0.0.0. The drives are 40x of the WD4001FYYG SAS 4TB variety, firmware VR02. The spot-checks I've done so far seem to show that both device instances of a drive show up in prtconf -Dv with identical serial numbers and identical devid and guid values, so I'm not sure what might be missing to allow mpxio to recognize them as the same device. Has anyone out there got this type of hardware working? In a multipath configuration? Suggestions on mdb or dtrace code I can use to debug? Are there secrets to the internal daisy-chain cabling that our vendor is not aware of? Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool layout vs resilver times
On Jan 5, 2013, at 9:42 AM, Russ Poyner rpoy...@engr.wisc.edu wrote: I'm configuring a box with 24x 3Tb consumer SATA drives, and wondering about the best way to configure the pool. The customer wants capacity on the cheap, and I want something I can service without sweating too much about data loss. Due to capacity concerns raid 10 is probably out, which leaves various raidz choices You should consider space, data dependability as measured by Mean Time to Data Loss (MTTDL), and performance. For MTTDL[1] model, let's use 700k hours MTBF for the disks and 168 hours for recovery (48 hours logistical + 120 hours resilver of a full disk) For performance, lets hope for 7,200 rpms and about 80 IOPS for small, random reads with 100% cache miss. 1. A stripe of four 6 disk raidz2 Option 1 space ~= 4 * (6 - 2) * 3TB = 48 TB MTTDL[1] = 8.38e+5 years or 0.000119% Annualized Failure Rate (AFR) small, random read performance, best of worst case = 4 * (6/4) * 80 IOPS = 480 IOPS 2. A stripe of two 11 disk raidz3 with 2 hot spares. Option 2 space ~= 2 * (11 - 3) * 3TB = 48 TB MTTDL[1] = 3.62e+7 years or 0.03% AFR small, random read performance, best of worst case = 2 * (11/8) * 80 IOPS = 220 IOPS Option 2a (no hot spares) space ~= 2 * (12 - 3) * 3TB = 54 TB MTTDL[2] = 1.90e+7 years or 0.05% AFR small, random read performance, best of worst case = 2 * (12/9) * 80 IOPS = 213 IOPS Other, better ideas? There are thousands of permutations you could consider :-) For 24-bay systems with double parity or better, we also see a 3x8-disk as a common configuration. Offhand, I'd say we see more 4x6-disk and 3x8-disk configs than any configs with more than 10 disks per set. My questions are A. How long will resilvering take with these layouts when the disks start dying? It depends on the concurrent workload. By default resilvers are throttled and give way to other workload. In general, for double or triple parity RAID, you don't need to worry too much on a per-disk basis. The conditions you need to worry about are where the failure cause is common to all disks, such as a controller, fans, cabling, or power because they are more likely than a triple failure of disks (as clearly shown by the MTTDL[1] model results above) B. Should I prefer hot spares or additional parity drives, and why? In general, addional parity is better than hot spares. You get more performance and better data dependability. The box is a supermicro with 36 bays controlled through a single LSI 9211-8i. There is a separate intel 320 ssd for the OS. The purpose is to backup data from the customer's windows workstations. I'm leaning toward using BackupPC for the backups since it seems to combine good efficiency with a fairly customer-friendly web interface. Sounds like a good plan. -- richard I'm running FreeBSD 9, after having failed to get the plugin jail working in FreeNAS, also for personal reasons I find csh easier to use than the FreeNAS web interface. My impression is that FreeBSD combines a mature OS with the second oldest/best (after Illumos) free implementation of zfs. Thanks in advance Russ Poyner ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)
On Jan 4, 2013, at 11:12 AM, Robert Milkowski rmilkow...@task.gda.pl wrote: Illumos is not so good at dealing with huge memory systems but perhaps it is also more stable as well. Well, I guess that it depends on your environment, but generally I would expect S11 to be more stable if only because the sheer amount of bugs reported by paid customers and bug fixes by Oracle that Illumos is not getting (lack of resource, limited usage, etc.). There is a two-edged sword. Software reliability analysis shows that the most reliable software is the software that is oldest and unchanged. But people also want new functionality. So while Oracle has more changes being implemented in Solaris, it is destabilizing while simultaneously improving reliability. Unfortunately, it is hard to get both wins. What is more likely is that new features are being driven into Solaris 11 that are destabilizing. By contrast, the number of new features being added to illumos-gate (not to be confused with illumos-based distros) is relatively modest and in all cases are not gratuitous. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VDI iops with caching
On Jan 3, 2013, at 8:38 PM, Geoff Nordli geo...@gnaa.net wrote: Thanks Richard, Happy New Year. On 13-01-03 09:45 AM, Richard Elling wrote: On Jan 2, 2013, at 8:45 PM, Geoff Nordli geo...@gnaa.net wrote: I am looking at the performance numbers for the Oracle VDI admin guide. http://docs.oracle.com/html/E26214_02/performance-storage.html From my calculations for 200 desktops running Windows 7 knowledge user (15 iops) with a 30-70 read/write split it comes to 5100 iops. Using 7200 rpm disks the requirement will be 68 disks. This doesn't seem right, because if you are using clones with caching, you should be able to easily satisfy your reads from ARC and L2ARC. As well, Oracle VDI by default caches writes; therefore the writes will be coalesced and there will be no ZIL activity. All of these IOPS -- VDI users guidelines are wrong. The problem is that the variability of response time is too great for a HDD. The only hope we have of getting the back-of-the-napkin calculations to work is to reduce the variability by using a device that is more consistent in its response (eg SSDs). For sure there is going to be a lot of variability, but it seems we aren't even close. Have you seen any back-of-the-napkin calculations which take into consideration SSDs for cache usage? Yes. I've written a white paper on the subject, somewhere on the nexenta.com website (if it is still available). But more current info is presentation at ZFSday. http://www.youtube.com/watch?v=A4yrSfaskwI http://www.slideshare.net/relling Anyone have other guidelines on what they are seeing for iops with vdi? The successful VDI implementations I've seen have relatively small space requirements for the performance-critical work. So there are a bunch of companies offering SSD-based arrays for that market. If you're stuck with HDDs, then effective use of snapshots+clones with a few GB of RAM and slog can support quite a few desktops. -- richard Yes, I would like to stick with HDDs. I am just not quite sure what quite a few desktops mean. I thought for sure there would be lots of people around that have done small deployments using a standard ZFS deployment. ... and large :-) I did 100 desktops with 2 SSDs two years ago. The presentation was given at OpenStorage Summit 2010. I don't think there is a video, though :-(. Fundamentally, people like to use sizing in IOPS, but all IOPS are not created equal. An I/O satisfied by ARC is often limited by network bandwidth constraints whereas an I/O that hits a slow pool is often limited by HDD latency. The two are 5 orders of magnitude different when using HDDs in the pool. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VDI iops with caching
On Jan 2, 2013, at 8:45 PM, Geoff Nordli geo...@gnaa.net wrote: I am looking at the performance numbers for the Oracle VDI admin guide. http://docs.oracle.com/html/E26214_02/performance-storage.html From my calculations for 200 desktops running Windows 7 knowledge user (15 iops) with a 30-70 read/write split it comes to 5100 iops. Using 7200 rpm disks the requirement will be 68 disks. This doesn't seem right, because if you are using clones with caching, you should be able to easily satisfy your reads from ARC and L2ARC. As well, Oracle VDI by default caches writes; therefore the writes will be coalesced and there will be no ZIL activity. All of these IOPS -- VDI users guidelines are wrong. The problem is that the variability of response time is too great for a HDD. The only hope we have of getting the back-of-the-napkin calculations to work is to reduce the variability by using a device that is more consistent in its response (eg SSDs). Anyone have other guidelines on what they are seeing for iops with vdi? The successful VDI implementations I've seen have relatively small space requirements for the performance-critical work. So there are a bunch of companies offering SSD-based arrays for that market. If you're stuck with HDDs, then effective use of snapshots+clones with a few GB of RAM and slog can support quite a few desktops. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] poor CIFS and NFS performance
On Jan 3, 2013, at 12:33 PM, Eugen Leitl eu...@leitl.org wrote: On Sun, Dec 30, 2012 at 06:02:40PM +0100, Eugen Leitl wrote: Happy $holidays, I have a pool of 8x ST31000340AS on an LSI 8-port adapter as Just a little update on the home NAS project. I've set the pool sync to disabled, and added a couple of 8. c4t1d0 ATA-INTELSSDSA2M080-02G9 cyl 11710 alt 2 hd 224 sec 56 /pci@0,0/pci1462,7720@11/disk@1,0 9. c4t2d0 ATA-INTELSSDSA2M080-02G9 cyl 11710 alt 2 hd 224 sec 56 /pci@0,0/pci1462,7720@11/disk@2,0 Setting sync=disabled means your log SSDs (slogs) will not be used. -- richard I had no clue what the partitions names (created with napp-it web interface, a la 5% log and 95% cache, of 80 GByte) were and so did a iostat -xnp 1.40.35.50.0 0.0 0.00.00.0 0 0 c4t1d0 0.10.03.70.0 0.0 0.00.00.5 0 0 c4t1d0s2 0.10.02.60.0 0.0 0.00.00.5 0 0 c4t1d0s8 0.00.00.00.0 0.0 0.00.00.2 0 0 c4t1d0p0 0.00.00.00.0 0.0 0.00.00.0 0 0 c4t1d0p1 0.00.00.00.0 0.0 0.00.00.0 0 0 c4t1d0p2 0.00.00.00.0 0.0 0.00.00.0 0 0 c4t1d0p3 0.00.00.00.0 0.0 0.00.00.0 0 0 c4t1d0p4 1.20.31.40.0 0.0 0.00.00.0 0 0 c4t2d0 0.00.00.60.0 0.0 0.00.00.4 0 0 c4t2d0s2 0.00.00.70.0 0.0 0.00.00.4 0 0 c4t2d0s8 0.10.00.00.0 0.0 0.00.00.2 0 0 c4t2d0p0 0.00.00.00.0 0.0 0.00.00.0 0 0 c4t2d0p1 0.00.00.00.0 0.0 0.00.00.0 0 0 c4t2d0p2 then issued # zpool add tank0 cache /dev/dsk/c4t1d0p1 /dev/dsk/c4t2d0p1 # zpool add tank0 log mirror /dev/dsk/c4t1d0p0 /dev/dsk/c4t2d0p0 which resulted in root@oizfs:~# zpool status pool: rpool state: ONLINE scan: scrub repaired 0 in 0h1m with 0 errors on Wed Jan 2 21:09:23 2013 config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c4t3d0s0 ONLINE 0 0 0 errors: No known data errors pool: tank0 state: ONLINE scan: scrub repaired 0 in 5h17m with 0 errors on Wed Jan 2 17:53:20 2013 config: NAME STATE READ WRITE CKSUM tank0 ONLINE 0 0 0 raidz3-0 ONLINE 0 0 0 c3t5000C500098BE9DDd0 ONLINE 0 0 0 c3t5000C50009C72C48d0 ONLINE 0 0 0 c3t5000C50009C73968d0 ONLINE 0 0 0 c3t5000C5000FD2E794d0 ONLINE 0 0 0 c3t5000C5000FD37075d0 ONLINE 0 0 0 c3t5000C5000FD39D53d0 ONLINE 0 0 0 c3t5000C5000FD3BC10d0 ONLINE 0 0 0 c3t5000C5000FD3E8A7d0 ONLINE 0 0 0 logs mirror-1 ONLINE 0 0 0 c4t1d0p0 ONLINE 0 0 0 c4t2d0p0 ONLINE 0 0 0 cache c4t1d0p1 ONLINE 0 0 0 c4t2d0p1 ONLINE 0 0 0 errors: No known data errors which resulted in bonnie++ befo': NAME SIZEBonnie Date(y.m.d) FileSeq-Wr-Chr %CPU Seq-Write %CPUSeq-Rewr%CPUSeq-Rd-Chr %CPU Seq-Read%CPURnd Seeks %CPUFiles Seq-Create Rnd-Create rpool 59.5G start 2012.12.28 15576M 24 MB/s 61 47 MB/s 18 40 MB/s 19 26 MB/s 98 273 MB/s 48 2657.2/s25 16 12984/s 12058/s tank0 7.25T start 2012.12.29 15576M 35 MB/s 86 145 MB/s48 109 MB/s50 25 MB/s 97 291 MB/s 53 819.9/s 12 16 12634/s 9194/s aftuh: -Wr-Chr%CPUSeq-Write %CPUSeq-Rewr%CPU Seq-Rd-Chr %CPUSeq-Read%CPURnd Seeks %CPUFiles Seq-Create Rnd-Create rpool 59.5G start 2012.12.28 15576M 24 MB/s 61 47 MB/s 18 40 MB/s 19 26 MB/s 98 273 MB/s 48 2657.2/s25 16 12984/s 12058/s tank0 7.25T start 2013.01.03 15576M 35 MB/s 86 149 MB/s48 111 MB/s50 26 MB/s 98 404 MB/s 76 1094.3/s12 16 12601/s 9937/s Does the layout make sense? Do the stats make sense, or is there still something very wrong with that pool? Thanks. ___ zfs-discuss mailing
Re: [zfs-discuss] poor CIFS and NFS performance
On Jan 2, 2013, at 2:03 AM, Eugen Leitl eu...@leitl.org wrote: On Sun, Dec 30, 2012 at 10:40:39AM -0800, Richard Elling wrote: On Dec 30, 2012, at 9:02 AM, Eugen Leitl eu...@leitl.org wrote: The system is a MSI E350DM-E33 with 8 GByte PC1333 DDR3 memory, no ECC. All the systems have Intel NICs with mtu 9000 enabled, including all switches in the path. Does it work faster with the default MTU? No, it was even slower, that's why I went from 1500 to 9000. I estimate it brought ~20 MByte/s more peak on Windows 7 64 bit CIFS. OK, then you have something else very wrong in your network. Also check for retrans and errors, using the usual network performance debugging checks. Wireshark or tcpdump on Linux/Windows? What would you suggest for OI? Look at all of the stats for all NICs and switches on both ends of each wire. Look for collisions (should be 0), drops (should be 0), dups (should be 0), retrans (should be near 0), flow control (server shouldn't see flow control activity), etc. There is considerable written material on how to diagnose network flakiness. P.S. Not sure whether this is pathological, but the system does produce occasional soft errors like e.g. dmesg More likely these are due to SMART commands not being properly handled Otherwise napp-it attests full SMART support. for SATA devices. They are harmless. Yep, this is a SATA/SAS/SMART interaction where assumptions are made that might not be true. Usually it means that the SMART probes are using SCSI commands on SATA disks. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] poor CIFS and NFS performance
On Dec 30, 2012, at 9:02 AM, Eugen Leitl eu...@leitl.org wrote: Happy $holidays, I have a pool of 8x ST31000340AS on an LSI 8-port adapter as a raidz3 (no compression nor dedup) with reasonable bonnie++ 1.03 values, e.g. 145 MByte/s Seq-Write @ 48% CPU and 291 MByte/s Seq-Read @ 53% CPU. It scrubs with 230+ MByte/s with reasonable system load. No hybrid pools yet. This is latest beta napp-it on OpenIndiana 151a5 server, living on a dedicated 64 GByte SSD. The system is a MSI E350DM-E33 with 8 GByte PC1333 DDR3 memory, no ECC. All the systems have Intel NICs with mtu 9000 enabled, including all switches in the path. Does it work faster with the default MTU? Also check for retrans and errors, using the usual network performance debugging checks. My problem is pretty poor network throughput. An NFS mount on 12.04 64 bit Ubuntu (mtu 9000) or CIFS are read at about 23 MBytes/s. Windows 7 64 bit (also jumbo frames) reads at about 65 MBytes/s. The highest transfer speed on Windows just touches 90 MByte/s, before falling back to the usual 60-70 MBytes/s. I kinda can live with above values, but I have a feeling the setup should be able to saturate GBit Ethernet with large file transfers, especially on Linux (20 MByte/s is nothing to write home about). Does anyone have any suggestions on how to debug/optimize throughput? Thanks, and happy 2013. P.S. Not sure whether this is pathological, but the system does produce occasional soft errors like e.g. dmesg More likely these are due to SMART commands not being properly handled for SATA devices. They are harmless. -- richard Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0 Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error Dec 30 17:45:00 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor unique code 0x0), ASCQ: 0x1d, FRU: 0x0 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50009c72c48 (sd9): Dec 30 17:45:01 oizfs Error for Command: undecoded cmd 0xa1Error Level: Recovered Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor unique code 0x0), ASCQ: 0x1d, FRU: 0x0 Dec 30 17:45:01 oizfs pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) instance 0 irq 0xe vector 0x45 ioapic 0x3 intin 0xe is bound to cpu 0 Dec 30 17:45:01 oizfs pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) instance 0 irq 0xe vector 0x45 ioapic 0x3 intin 0xe is bound to cpu 1 Dec 30 17:45:01 oizfs pcplusmp: [ID 805372 kern.info] pcplusmp: ide (ata) instance 0 irq 0xe vector 0x45 ioapic 0x3 intin 0xe is bound to cpu 0 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c50009c73968 (sd4): Dec 30 17:45:01 oizfs Error for Command: undecoded cmd 0xa1Error Level: Recovered Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0 Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error Dec 30 17:45:01 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor unique code 0x0), ASCQ: 0x1d, FRU: 0x0 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g5000c500098be9dd (sd10): Dec 30 17:45:03 oizfs Error for Command: undecoded cmd 0xa1Error Level: Recovered Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0 Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error Dec 30 17:45:03 oizfs scsi: [ID 107833 kern.notice] ASC: 0x0 (vendor unique code 0x0), ASCQ: 0x1d, FRU: 0x0 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1462,7720@11/disk@3,0 (sd8): Dec 30 17:45:04 oizfs Error for Command: undecoded cmd 0xa1Error Level: Recovered Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice] Requested Block: 0 Error Block: 0 Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice] Sense Key: Soft_Error Dec 30 17:45:04 oizfs scsi: [ID 107833 kern.notice]
Re: [zfs-discuss] ZFS QoS and priorities
On Dec 6, 2012, at 5:30 AM, Matt Van Mater matt.vanma...@gmail.com wrote: I'm unclear on the best way to warm data... do you mean to simply `dd if=/volumes/myvol/data of=/dev/null`? I have always been under the impression that ARC/L2ARC has rate limiting how much data can be added to the cache per interval (i can't remember the interval). Is this not the case? If there is some rate limiting in place, dd-ing the data like my example above would not necessarily cache all of the data... it might take several iterations to populate the cache, correct? Quick update... I found at least one reference to the rate limiting I was referring to. It was Richard from ~2.5 years ago :) http://marc.info/?l=zfs-discussm=127060523611023w=2 I assume the source code reference is still valid, in which case a population of 8MB per 1 second into L2ARC is extremely slow in my books and very conservative... It would take a very long time to warm the hundreds of gigs of VMs we have into cache. Perhaps the L2ARC_WRITE_BOOST tunable might be a good place to aggressively warm a cache, but my preference is to not touch the tunables if I have a choice. I'd rather the system default be updated to reflect modern hardware, that way everyone benefits and I'm not running some custom build. Yep, the default L2ARC fill rate is quite low for modern systems. It is not uncommon to see it increased significantly, with the corresponding improvements in hit rate for busy systems. Can you file an RFE at https://www.illumos.org/projects/illumos-gate/issues/ Thanks! -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
On Dec 5, 2012, at 5:41 AM, Jim Klimov jimkli...@cos.ru wrote: On 2012-12-05 04:11, Richard Elling wrote: On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru mailto:jimkli...@cos.ru wrote: I've heard a claim that ZFS relies too much on RAM caching, but implements no sort of priorities (indeed, I've seen no knobs to tune those) - so that if the storage box receives many different types of IO requests with different administrative weights in the view of admins, it can not really throttle some IOs to boost others, when such IOs have to hit the pool's spindles. Caching has nothing to do with QoS in this context. *All* modern filesystems cache to RAM, otherwise they are unusable. Yes, I get that. However, many systems get away with less RAM than recommended for ZFS rigs (like the ZFS SA with a couple hundred GB as the starting option), and make their compromises elsewhere. They have to anyway, and they get different results, perhaps even better suited to certain narrow or big niches. This is nothing more than a specious argument. They have small caches, so their performance is not as good as those with larger caches. This is like saying you need a smaller CPU cache because larger CPU caches get full. Whatever the aggregate result, this difference does lead to some differing features that The Others' marketing trumpets praise as the advantage :) - like this ability to mark some IO traffic as of higher priority than other traffics, in one case (which is now also an Oracle product line, apparently)... Actually, this question stems from a discussion at a seminar I've recently attended - which praised ZFS but pointed out its weaknesses against some other players on the market, so we are not unaware of those. For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. Please read the papers on the ARC and how it deals with MFU and MRU cache types. You can adjust these policies using the primarycache and secondarycache properties at the dataset level. I've read on that, and don't exactly see how much these help if there is pressure on RAM so that cache entries expire... Meaning, if I want certain datasets to remain cached as long as possible (i.e. serve website or DB from RAM, not HDD), at expense of other datasets that might see higher usage, but have lower business priority - how do I do that? Or, perhaps, add (L2)ARC shares, reservations and/or quotas concepts to the certain datasets which I explicitly want to throttle up or down? MRU evictions take precedence over MFU evictions. If the data is not in MFU, then it is, by definition, not being frequently used. At most, now I can mark the lower-priority datasets' data or even metadata as not cached in ARC or L2ARC. On-off. There seems to be no smaller steps, like in QoS tags [0-7] or something like that. BTW, as a short side question: is it a true or false statement, that: if I set primarycache=metadata, then ZFS ARC won't cache any userdata and thus it won't appear in (expire into) L2ARC? So the real setting is that I can cache data+meta in RAM, and only meta in SSD? Not the other way around (meta in RAM but both data+meta in SSD)? That is correct, by my reading of the code. AFAIK, now such requests would hit the ARC, then the disks if needed - in no particular order. Well, can the order be made particular with current ZFS architecture, i.e. by setting some datasets to have a certain NICEness or another priority mechanism? ZFS has a priority-based I/O scheduler that works at the DMU level. However, there is no system call interface in UNIX that transfers priority or QoS information (eg read() or write()) into the file system VFS interface. So the grainularity of priority control is by zone or dataset. I do not think I've seen mention of priority controls per dataset, at least not in generic ZFS. Actually, that was part of my question above. And while throttling or resource shares between higher level software components (zones, VMs) might have similar effect, this is not something really controlled and enforced by the storage layer. The priority scheduler is by type of I/O request. For example, sync requests have priority over async requests. Reads and writes have priority over scrubbing etc. The inter-dataset scheduling is done at the zone level. There is more work being done in this area, but it is still in the research phase. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
On Dec 5, 2012, at 7:46 AM, Matt Van Mater matt.vanma...@gmail.com wrote: I don't have anything significant to add to this conversation, but wanted to chime in that I also find the concept of a QOS-like capability very appealing and that Jim's recent emails resonate with me. You're not alone! I believe there are many use cases where a granular prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority IO to a specific zvol, share, etc would be useful. My experience is stronger in the networking side and I envision a weighted class based queuing methodology (or something along those lines). I recognize that ZFS's architecture preference for coalescing writes and reads into larger sequential batches might conflict with a QOS-like capability... Perhaps the ARC/L2ARC tuning might be a good starting point towards that end? At present, I do not see async write QoS as being interesting. That leaves sync writes and reads as the managed I/O. Unfortunately, with HDDs, the variance in response time queue management time, so the results are less useful than the case with SSDs. Control theory works, once again. For sync writes, they are often latency-sensitive and thus have the highest priority. Reads have lower priority, with prefetch reads at lower priority still. On a related note (maybe?) I would love to see pool-wide settings that control how aggressively data is added/removed form ARC, L2ARC, etc. Evictions are done on an as-needed basis. Why would you want to evict more than needed? So you could fetch it again? Prefetching can be more aggressive, but we actually see busy systems disabling prefetch to improve interactive performance. Queuing theory works, once again. Something that would accelerate the warming of a cold pool of storage or be more aggressive in adding/removing cached data on a volatile dataset (e.g. where Virtual Machines are turned on/off frequently). I have heard that some of these defaults might be changed in some future release of Illumos, but haven't seen any specifics saying that the idea is nearing fruition in release XYZ. It is easy to warm data (dd), even to put it into MRU (dd + dd). For best performance with VMs, MRU works extremely well, especially with clones. There are plenty of good ideas being kicked around here, but remember that to support things like QoS at the application level, the applications must be written to an interface that passes QoS hints all the way down the stack. Lacking these interfaces, means that QoS needs to be managed by hand... and that management effort must be worth the effort. -- richard Matt On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov jimkli...@cos.ru wrote: On 2012-11-29 10:56, Jim Klimov wrote: For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. On a side note, I'm now revisiting old ZFS presentations collected over the years, and one suggested as TBD statements the ideas that metaslabs with varying speeds could be used for specific tasks, and not only to receive the allocations first so that a new pool would perform quickly. I.e. TBD: Workload specific freespace selection policies. Say, I create a new storage box and lay out some bulk file, backup and database datasets. Even as they are receiving their first bytes, I have some idea about the kind of performance I'd expect from them - with QoS per dataset I might destine the databases to the fast LBAs (and smaller seeks between tracks I expect to use frequently), and the bulk data onto slower tracks right from the start, and the rest of unspecified data would grow around the middle of the allocation range. These types of data would then only creep onto the less fitting metaslabs (faster for bulk, slower for DB) if the target ones run out of free space. Then the next-best-fitting would be used... This one idea is somewhat reminiscent of hierarchical storage management, except that it is about static allocation at the write-time and takes place within the single disk (or set of similar disks), in order to warrant different performance for different tasks. ///Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
bug fix below... On Dec 5, 2012, at 1:10 PM, Richard Elling richard.ell...@gmail.com wrote: On Dec 5, 2012, at 7:46 AM, Matt Van Mater matt.vanma...@gmail.com wrote: I don't have anything significant to add to this conversation, but wanted to chime in that I also find the concept of a QOS-like capability very appealing and that Jim's recent emails resonate with me. You're not alone! I believe there are many use cases where a granular prioritization that controls how ARC, L2ARC, ZIL and underlying vdevs are used to give priority IO to a specific zvol, share, etc would be useful. My experience is stronger in the networking side and I envision a weighted class based queuing methodology (or something along those lines). I recognize that ZFS's architecture preference for coalescing writes and reads into larger sequential batches might conflict with a QOS-like capability... Perhaps the ARC/L2ARC tuning might be a good starting point towards that end? At present, I do not see async write QoS as being interesting. That leaves sync writes and reads as the managed I/O. Unfortunately, with HDDs, the variance in response time queue management time, so the results are less useful than the case with SSDs. Control theory works, once again. For sync writes, they are often latency-sensitive and thus have the highest priority. Reads have lower priority, with prefetch reads at lower priority still. On a related note (maybe?) I would love to see pool-wide settings that control how aggressively data is added/removed form ARC, L2ARC, etc. Evictions are done on an as-needed basis. Why would you want to evict more than needed? So you could fetch it again? Prefetching can be more aggressive, but we actually see busy systems disabling prefetch to improve interactive performance. Queuing theory works, once again. Something that would accelerate the warming of a cold pool of storage or be more aggressive in adding/removing cached data on a volatile dataset (e.g. where Virtual Machines are turned on/off frequently). I have heard that some of these defaults might be changed in some future release of Illumos, but haven't seen any specifics saying that the idea is nearing fruition in release XYZ. It is easy to warm data (dd), even to put it into MRU (dd + dd). For best performance with VMs, MRU works extremely well, especially with clones. Should read: It is easy to warm data (dd), even to put it into MFU (dd + dd). For best performance with VMs, MFU works extremely well, especially with clones. -- richard There are plenty of good ideas being kicked around here, but remember that to support things like QoS at the application level, the applications must be written to an interface that passes QoS hints all the way down the stack. Lacking these interfaces, means that QoS needs to be managed by hand... and that management effort must be worth the effort. -- richard Matt On Wed, Dec 5, 2012 at 10:26 AM, Jim Klimov jimkli...@cos.ru wrote: On 2012-11-29 10:56, Jim Klimov wrote: For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. On a side note, I'm now revisiting old ZFS presentations collected over the years, and one suggested as TBD statements the ideas that metaslabs with varying speeds could be used for specific tasks, and not only to receive the allocations first so that a new pool would perform quickly. I.e. TBD: Workload specific freespace selection policies. Say, I create a new storage box and lay out some bulk file, backup and database datasets. Even as they are receiving their first bytes, I have some idea about the kind of performance I'd expect from them - with QoS per dataset I might destine the databases to the fast LBAs (and smaller seeks between tracks I expect to use frequently), and the bulk data onto slower tracks right from the start, and the rest of unspecified data would grow around the middle of the allocation range. These types of data would then only creep onto the less fitting metaslabs (faster for bulk, slower for DB) if the target ones run out of free space. Then the next-best-fitting would be used... This one idea is somewhat reminiscent of hierarchical storage management, except that it is about static allocation at the write-time and takes place within the single disk (or set of similar disks), in order to warrant different performance for different tasks. ///Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo
Re: [zfs-discuss] ZFS QoS and priorities
On Nov 29, 2012, at 1:56 AM, Jim Klimov jimkli...@cos.ru wrote: I've heard a claim that ZFS relies too much on RAM caching, but implements no sort of priorities (indeed, I've seen no knobs to tune those) - so that if the storage box receives many different types of IO requests with different administrative weights in the view of admins, it can not really throttle some IOs to boost others, when such IOs have to hit the pool's spindles. Caching has nothing to do with QoS in this context. *All* modern filesystems cache to RAM, otherwise they are unusable. For example, I might want to have corporate webshop-related databases and appservers to be the fastest storage citizens, then some corporate CRM and email, then various lower priority zones and VMs, and at the bottom of the list - backups. Please read the papers on the ARC and how it deals with MFU and MRU cache types. You can adjust these policies using the primarycache and secondarycache properties at the dataset level. AFAIK, now such requests would hit the ARC, then the disks if needed - in no particular order. Well, can the order be made particular with current ZFS architecture, i.e. by setting some datasets to have a certain NICEness or another priority mechanism? ZFS has a priority-based I/O scheduler that works at the DMU level. However, there is no system call interface in UNIX that transfers priority or QoS information (eg read() or write()) into the file system VFS interface. So the grainularity of priority control is by zone or dataset. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS QoS and priorities
On Dec 1, 2012, at 6:54 PM, Nikola M. minik...@gmail.com wrote: On 12/ 2/12 03:24 AM, Nikola M. wrote: It is using Solaris Zones and throttling their disk usage on that level, so you separate workload processes on separate zones. Or even put KVM machines under the zones (Joyent and OI support Joyent-written KVM/Intel implementation in Illumos) for the same reason of I/O throttling. They (Joyent) say that their solution is made in not too much code, but gives very good results (they run massive cloud computing service, with many zones and KVM VM's so they might know). http://wiki.smartos.org/display/DOC/Tuning+the+IO+Throttle http://dtrace.org/blogs/wdp/2011/03/our-zfs-io-throttle/ There is short video from 16th minute onward, from BayLISA meetup at Joyent, August 16, 2012 https://www.youtube.com/watch?v=6csFi0D5eGY Talking about ZFS Throttle implementation architecture in Illumos , from Joyent's Smartos. There was a good presentation on this at the OpenStorage Summit in 2011. Look for it on youtube. I learned it is also available in Entic.net-sponsored Openindiana and probably in Nexenta, too, since it is implemented inside Illumos. NexentaStor 3.x is not an illumos-based distribution, it is based on OpenSolaris b134. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dm-crypt + ZFS on Linux
On Nov 23, 2012, at 11:56 AM, Fabian Keil freebsd-lis...@fabiankeil.de wrote: Just in case your GNU/Linux experiments don't work out, you could also try ZFS on Geli on FreeBSD which works reasonably well. For illumos-based distros or Solaris 11, using ZFS with lofi has been well discussed for many years. Prior to the crypto option being integrated as a first class citizen in OpenSolaris, the codename used was xlofi, so try that in your google searches, or look at the man page for lofiadm -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hardware Recommendations: SAS2 JBODs
On Nov 13, 2012, at 12:08 PM, Peter Tripp pe...@psych.columbia.edu wrote: Hi folks, I'm in the market for a couple of JBODs. Up until now I've been relatively lucky with finding hardware that plays very nicely with ZFS. All my gear currently in production uses LSI SAS controllers (3801e, 9200-16e, 9211-8i) with backplanes powered by LSI SAS expanders (Sun x4250, Sun J4400, etc). But I'm in the market for SAS2 JBODs to support a large number 3.5inch SAS disks (60+ 3TB disks to start). I'm aware of potential issues with SATA drives/interposers and the whole SATA Tunneling Protocol (STP) nonsense, so I'm going to stick to a pure SAS setup. Also, since I've had trouble with in the past with daisy-chained SAS JBODs I'll probably stick with one SAS 4x cable (SFF8088) per JBOD and unless there were a compelling reason for multi-pathing I'd probably stick to a single controller. If possible I'd rather buy 20 packs of enterprise SAS disks with 5yr warranties and have the JBOD come with empty trays, but would also consider buying disks with the JBOD if the price wasn't too crazy. Does anyone have any positive/negative experiences with any of the following with ZFS: * SuperMicro SC826E16-R500LPB (2U 12 drives, dual 500w PS, single LSI SAS2X28 expander) * SuperMicro SC846BE16-R920B (4U 24 drives, dual 920w PS, single unknown expander) * Dell PowerVault MD 1200 (2U 12 drives, dual 600w PS, dual unknown expanders) * HP StorageWorks D2600 (2U 12 drives, dual 460w PS, single/dual unknown expanders) I've used all of the above and all of the DataOn systems, too (Hi Rocky!) No real complaints, though as others have noted the supermicro gear tends to require more work to get going. -- richard I'm leaning towards the SuperMicro stuff, but every time I order SuperMicro gear there's always something missing or wrongly configured so some of the cost savings gets eaten up with my time figuring out where things went wrong and returning/ordering replacements. The Dell/HP gear I'm sure is fine, but buying disks from them gets pricey quick. The last time I looked they charged $150 extra per disk for when the only added value was a proprietary sled a shorter warranty (3yr vs 5yr). I'm open to other JBOD vendors too, was just really just curious what folks were using when they needed more than two dozen 3.5 SAS disks for use with ZFS. Thanks -Peter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC de-allocation with large ram
On Oct 22, 2012, at 6:52 AM, Chris Nagele nag...@wildbit.com wrote: If after it decreases in size it stays there it might be similar to: 7111576 arc shrinks in the absence of memory pressure After it dropped, it did build back up. Today is the first day that these servers are working under real production load and it is looking much better. arcstat is showing some nice numbers for arc, but l2 is still building. read hits miss hit% l2read l2hits l2miss l2hit% arcsz l2size 19K 17K 2.5K872.5K 4902.0K 19 148G371G 41K 39K 2.3K942.3K 1842.1K 7 148G371G 34K 34K 69498 694 17 677 2 148G371G 16K 15K 1.0K931.0K 161.0K 1 148G371G 39K 36K 2.3K942.3K 202.3K 0 148G371G 23K 22K 74696 746 76 670 10 148G371G 49K 47K 1.7K961.7K 2491.5K 14 148G371G 23K 21K 1.4K931.4K 381.4K 2 148G371G My only guess is that the large zfs send / recv streams were affecting the cache when they started and finished. There are other cases where data is evicted from the ARC, though I don't have a complete list at my fingertips. For example, if a zvol is closed, then the data for the zvol is evicted. -- richard Thanks for the responses and help. Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to older version
On Oct 19, 2012, at 4:59 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Richard Elling At some point, people will bitterly regret some zpool upgrade with no way back. uhm... and how is that different than anything else in the software world? No attempt at backward compatibility, and no downgrade path, not even by going back to an older snapshot before the upgrade. ZFS has a stellar record of backwards compatibility. The only break with backwards compatibility I can recall was a bug fix in the send stream somewhere around opensolaris b34. Perhaps you are confusing backwards compatibility with forwards compatibility? -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to older version
On Oct 19, 2012, at 1:04 AM, Michel Jansens michel.jans...@ulb.ac.be wrote: On 10/18/12 21:09, Michel Jansens wrote: Hi, I've been using a Solaris 10 update 9 machine for some time to replicate filesystems from different servers through zfs send|ssh zfs receive. This was done to store disaster recovery pools. The DR zpools are made from sparse files (to allow for easy/efficient backup to tape). Now I've installed a Solaris 11 machine and a SmartOS one. When I try to replicate the pools from those machines, I get an error because filesystem/pool version don't support some features/properties on the solaris 10u9. Is there a way (apart from rsync) to send a snapshot from a newer zpool to an older one? You have to create pools/filesystems with the older versions used by the destination machine. Thanks Ian, One thing that is annoying though with running old pool version on Solaris is that zpool status -x doesn't return 'all pools are healthy'. And I wonder how SmartOS or Solaris 11 will react with Solaris 10 update 9 version filesystem for zones or KVM... Also hearing about the new feature flags, I have a feeling that there is a risk of ZFS world being more and more fragmented. Feature flags offers a sane method to deal with the existing fragmentation. Everyone will have it, except Oracle Solaris. At some point, people will bitterly regret some zpool upgrade with no way back. uhm... and how is that different than anything else in the software world? In that fragmented world, some common exchange (replication) format would be reassuring. In this respect, I suppose Arne Jansen's zfs fits-send portable streams is good news, though it's write only (to BTRFS), And it looks like a filesystem only feature (not for volumes) FITS is interesting for those file systems that support snapshots. If the market demands, there could be some interesting work done for interop with ReFS and others. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] looking for slides for basic zfs intro
On Oct 19, 2012, at 6:37 AM, Eugen Leitl eu...@leitl.org wrote: Hi, I would like to give a short talk at my organisation in order to sell them on zfs in general, and on zfs-all-in-one and zfs as remote backup (zfs send). Googling will find a few shorter presos. I have full-day presos on slideshare http://www.slideshare.net/relling source available on request. -- richard Does anyone have a short set of presentation slides or maybe a short video I could pillage for that purpose? Thanks. -- Eugen ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing rpool device paths/drivers
On Oct 19, 2012, at 12:16 AM, James C. McPherson j...@opensolaris.org wrote: On 19/10/12 04:50 PM, Jim Klimov wrote: Hello all, I have one more thought - or a question - about the current strangeness of rpool import: is it supported, or does it work, to have rpools on multipathed devices? If yes (which I hope it is, but don't have a means to check) what sort of a string is saved into the pool's labels as its device path? Some metadevice which is on a layer above mpxio, or one of the physical storage device paths? If the latter is the case, what happens during system boot if the multipathing happens to choose another path, not the one saved in labels? if you run /usr/bin/strings over /etc/zfs/zpool.cache, you'll see that not only is the device path stored, but (more importantly) the devid. yuk. zdb -C is what you want. As far as I'm aware, having an rpool on multipathed devices is fine. Multiple paths to the device should still allow ZFS to obtain the same devid info... and we use devid's in preference to physical paths. It is fine. The boot process is slightly different in that zpool.cache is not consulted at first. However, it is consulted later, so there are edge cases where this can cause problems when there are significant changes in the device tree. The archives are full of workarounds for this rare case. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
On Oct 12, 2012, at 5:50 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Pedantically, a pool can be made in a file, so it works the same... Pool can only be made in a file, by a system that is able to create a pool. You can't send a pool, you can only send a dataset. Whether you receive the dataset into a pool or file is a minor nit, the send stream itself is consistent. Point is, his receiving system runs linux and doesn't have any zfs; his receiving system is remote from his sending system, and it has been suggested that he might consider making an iscsi target available, so the sending system could zpool create and zfs receive directly into a file or device on the receiving system, but it doesn't seem as if that's going to be possible for him - he's expecting to transport the data over ssh. So he's looking for a way to do a zfs receive on a linux system, transported over ssh. Suggested answers so far include building a VM on the receiving side, to run openindiana (or whatever) or using zfs-fuse-linux. He is currently writing his zfs send datastream into a series of files on the receiving system, but this has a few disadvantages as compared to doing zfs receive on the receiving side. Namely, increased risk of data loss and less granularity for restores. For these reasons, it's been suggested to find a way of receiving via zfs receive and he's exploring the possibilities of how to improve upon this situation. Namely, how to zfs receive on a remote linux system via ssh, instead of cat'ing or redirecting into a series of files. There, I think I've recapped the whole thread now. ;-) Yep, and cat works fine. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] horrible slow pool
Hi John, comment below... On Oct 11, 2012, at 3:10 AM, Carsten John cj...@mpi-bremen.de wrote: Hello everybody, I just wanted to share my experience with a (partially) broken SSD that was in use in a ZIL mirror. We experienced a dramatic performance problem with one of our zpools, serving home directories. Mainly NFS clients were affected. Our SunRay infrastructure came to a complete halt. Finally we were able to identify one SSD as the root caus. The SSD was still working, but quite slow. The issue didn't trigger ZFS to detect the disk as faulty. FMA didn't detect it, too. We identified the broken disk by issuing iostat -en'. After replacing the SSD, everything went back to normal. To prevent outages like this in the future I hacked together a quick and dirty bash script to detect disks with a given rate of total errors. The script might be used in conjunction with nagios. This shouldn't be needed. All of the fields of iostat are in kstats and nagios can already collect kstats. kstat -pm sderr The good thing about using this method is that it works with or without ZFS. The bad thing is that some SMART tools and devices trigger complaints that show up as errors (that can be safely ignored) -- richard Perhaps it's of use for others sa well: ### #!/bin/bash # check disk in all pools for errors. # partially failing (or slow) disks # may result in horribly degradded # performance of zpools despite the fact # the pool is still healthy # exit codes # 0 OK # 1 WARNING # 2 CRITICAL # 3 UNKONOWN OUTPUT= WARNING=0 CRITICAL=0 SOFTLIMIT=5 HARDLIMIT=20 LIST=$(zpool status | grep c[1-9].*d0 | awk '{print $1}') for DISK in $LIST do ERROR=$(iostat -enr $DISK | cut -d , -f 4 | grep ^[0-9]) if [[ $ERROR -gt $SOFTLIMIT ]] then OUTPUT=$OUTPUT, $DISK:$ERROR WARNING=1 fi if [[ $ERROR -gt $HARDLIMIT ]] then OUTPUT=$OUTPUT, $DISK:$ERROR CRITICAL=1 fi done if [[ $CRITICAL -gt 0 ]] then echo CRITICAL: Disks with error count = $HARDLIMIT found: $OUTPUT exit 2 fi if [[ $WARNING -gt 0 ]] then echo WARNING: Disks with error count = $SOFTLIMIT found: $OUTPUT exit 1 fi echo OK: No significant disk errors found exit 0 ### cu Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
On Oct 11, 2012, at 6:03 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Read it again he asked, On that note, is there a minimal user-mode zfs thing that would allow receiving a stream into an image file? Something like: zfs send ... | ssh user@host cat file He didn't say he wanted to cat to a file. But it doesn't matter. It was only clear from context, responding to the advice of zfs receiveing into a zpool-in-a-file, that it was clear he was asking about doing a zfs receive into a file, not just cat. If you weren't paying close attention to the thread, it would be easy to misunderstand what he was asking for. Pedantically, a pool can be made in a file, so it works the same... When he asked for minimal user-mode he meant, something less than a full-blown OS installation just for the purpose of zfs receive. He went on to say, he was considering zfs-fuse-on-linux. ... though I'm not convinced zfs-fuse supports files, whereas illumos/Solaris does. Perhaps a linux fuse person can respond. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
On Oct 11, 2012, at 2:58 PM, Phillip Wagstrom phillip.wagst...@gmail.com wrote: On Oct 11, 2012, at 4:47 PM, andy thomas wrote: According to a Sun document called something like 'ZFS best practice' I read some time ago, best practice was to use the entire disk for ZFS and not to partition or slice it in any way. Does this advice hold good for FreeBSD as well? My understanding of the best practice was that with Solaris prior to ZFS, it disabled the volatile disk cache. This is not quite correct. If you use the whole disk ZFS will attempt to enable the write cache. To understand why, remember that UFS (and ext, by default) can die a horrible death (+fsck) if there is a power outage and cached data is not flushed to disk. So by default, Sun shipped some disks with write cache disabled by default. For non-Sun disks, they are most often shipped with write cache enabled and the most popular file systems (NTFS) properly issue cache flush requests as needed (for the same reason ZFS issues cache flush requests). With ZFS, the disk cache is used, but after every transaction a cache-flush command is issued to ensure that the data made it the platters. Write cache is flushed after uberblock updates and for ZIL writes. This is important for uberblock updates, so the uberblock doesn't point to a garbaged MOS. It is important for ZIL writes, because they must be guaranteed written to media before ack. -- richard If you slice the disk, enabling the disk cache for the whole disk is dangerous because other file systems (meaning UFS) wouldn't do the cache-flush and there was a risk for data-loss should the cache fail due to, say a power outage. Can't speak to how BSD deals with the disk cache. I looked at a server earlier this week that was running FreeBSD 8.0 and had 2 x 1 Tb SAS disks in a ZFS 13 mirror with a third identical disk as a spare. Large file I/O throughput was OK but the mail jail it hosted had periods when it was very slow with accessing lots of small files. All three disks (the two in the ZFS mirror plus the spare) had been partitioned with gpart so that partition 1 was a 6 GB swap and partition 2 filled the rest of the disk and had a 'freebsd-zfs' partition on it. It was these second partitions that were part of the mirror. This doesn't sound like a very good idea to me as surelt disk seeks for swap and for ZFS file I/O are bound to clash. aren't they? It surely would make a slow, memory starved swapping system even slower. :) Another point about the Sun ZFS paper - it mentioned optimum performance would be obtained with RAIDz pools if the number of disks was between 3 and 9. So I've always limited my pools to a maximum of 9 active disks plus spares but the other day someone here was talking of seeing hundreds of disks in a single pool! So what is the current advice for ZFS in Solaris and FreeBSD? That number was drives per vdev, not per pool. -Phil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
On Oct 10, 2012, at 9:29 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Richard Elling If the recipient system doesn't support zfs receive, [...] On that note, is there a minimal user-mode zfs thing that would allow receiving a stream into an image file? No need for file/directory access etc. cat :-) He was asking if it's possible to do zfs receive on a system that doesn't natively support zfs. The answer is no, unless you want to consider fuse or similar. Read it again he asked, On that note, is there a minimal user-mode zfs thing that would allow receiving a stream into an image file? Something like: zfs send ... | ssh user@host cat file I can't speak about zfs on fuse or anything - except that I personally wouldn't trust it. There are differences even between zfs on solaris versus freebsd, vs whatever, all of which are fully supported, much better than zfs on fuse. But different people use and swear by all of these things - so maybe it would actually be a good solution for you. The direction I would personally go would be an openindiana virtual machine to do the zfs receive. I was thinking maybe the zfs-fuse-on-linux project may have suitable bits? I'm sure most Linux distros have cat hehe. Anyway. Answered above. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building an On-Site and Off-Size ZFS server, replication question
On Oct 7, 2012, at 3:50 PM, Johannes Totz johan...@jo-t.de wrote: On 05/10/2012 15:01, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Tiernan OToole I am in the process of planning a system which will have 2 ZFS servers, one on site, one off site. The on site server will be used by workstations and servers in house, and most of that will stay in house. There will, however, be data i want backed up somewhere else, which is where the offsite server comes in... This server will be sitting in a Data Center and will have some storage available to it (the whole server currently has 2 3Tb drives, though they are not dedicated to the ZFS box, they are on VMware ESXi). There is then some storage (currently 100Gb, but more can be requested) of SFTP enabled backup which i plan to use for some snapshots, but more on that later. Anyway, i want to confirm my plan and make sure i am not missing anything here... * build server in house with storage, pools, etc... * have a server in data center with enough storage for its reason, plus the extra for offsite backup * have one pool set as my offsite pool... anything in here should be backed up off site also... * possibly have another set as very offsite which will also be pushed to the SFTP server, but not sure... * give these pools out via SMB/NFS/iSCSI * every 6 or so hours take a snapshot of the 2 offsite pools. * do a ZFS send to the data center box * nightly, on the very offsite pool, do a ZFS send to the SFTP server * if anything goes wrong (my server dies, DC server dies, etc), Panic, download, pray... the usual... :) Anyway, I want to make sure i am doing this correctly... Is there anything on that list that sounds stupid or am i doing anything wrong? am i missing anything? Also, as a follow up question, but slightly unrelated, when it comes to the ZFS Send, i could use SSH to do the send, directly to the machine... Or i could upload the compressed, and possibly encrypted dump to the server... Which, for resume-ability and speed, would be suggested? And if i where to go with an upload option, any suggestions on what i should use? It is recommended, whenever possible, you should pipe the zfs send directly into a zfs receive on the receiving system. For two solid reasons: If a single bit is corrupted, the whole stream checksum is wrong and therefore the whole stream is rejected. So if this occurs, you want to detect it (in the form of one incremental failed) and then correct it (in the form of the next incremental succeeding). Whereas, if you store your streams on storage, it will go undetected, and everything after that point will be broken. If you need to do a restore, from a stream stored on storage, then your only choice is to restore the whole stream. You cannot look inside and just get one file. But if you had been doing send | receive, then you obviously can look inside the receiving filesystem and extract some individual specifics. If the recipient system doesn't support zfs receive, [...] On that note, is there a minimal user-mode zfs thing that would allow receiving a stream into an image file? No need for file/directory access etc. cat :-) I was thinking maybe the zfs-fuse-on-linux project may have suitable bits? I'm sure most Linux distros have cat -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How many disk in one pool
On Oct 5, 2012, at 1:57 PM, Albert Shih albert.s...@obspm.fr wrote: Hi all, I'm actually running ZFS under FreeBSD. I've a question about how many disks I «can» have in one pool. At this moment I'm running with one server (FreeBSD 9.0) with 4 MD1200 (Dell) meaning 48 disks. I've configure with 4 raidz2 in the pool (one on each MD1200) On what I understand I can add more more MD1200. But if I loose one MD1200 for any reason I lost the entire pool. In your experience what's the «limit» ? 100 disk ? I can't speak for current FreeBSD, but I've seen more than 400 disks (HDDs) in a single pool. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] removing upgrade notice from 'zpool status -x'
On Oct 4, 2012, at 8:58 AM, Jan Owoc jso...@gmail.com wrote: Hi, I have a machine whose zpools are at version 28, and I would like to keep them at that version for portability between OSes. I understand that 'zpool status' asks me to upgrade, but so does 'zpool status -x' (the man page says it should only report errors or unavailability). This is a problem because I have a script that assumes zpool status -x only returns errors requiring user intervention. The return code for zpool is ambiguous. Do not rely upon it to determine if the pool is healthy. You should check the health property instead. Is there a way to either: A) suppress the upgrade notice from 'zpool status -x' ? Pedantic answer, it is open source ;-) B) use a different command to get information about actual errors w/out encountering the upgrade notice ? I'm using OpenIndiana 151a6 on x86. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
On Oct 4, 2012, at 9:07 AM, Dan Swartzendruber dswa...@druber.com wrote: On 10/4/2012 11:48 AM, Richard Elling wrote: On Oct 4, 2012, at 8:35 AM, Dan Swartzendruber dswa...@druber.com wrote: This whole thread has been fascinating. I really wish we (OI) had the two following things that freebsd supports: 1. HAST - provides a block-level driver that mirrors a local disk to a network disk presenting the result as a block device using the GEOM API. This is called AVS in the Solaris world. In general, these systems suffer from a fatal design flaw: the authoritative view of the data is not also responsible for the replication. In other words, you can provide coherency but not consistency. Both are required to provide a single view of the data. Can you expand on this? I could, but I've already written a book on clustering. For a more general approach to understanding clustering, I can highly recommend Pfister's In Search of Clusters. http://www.amazon.com/In-Search-Clusters-2nd-Edition/dp/0138997098 NB, clustered storage is the same problem as clustered compute wrt state. 2. CARP. This exists as part of the OHAC project. -- richard These are both freely available? Yes. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Making ZIL faster
Thanks Neil, we always appreciate your comments on ZIL implementation. One additional comment below... On Oct 4, 2012, at 8:31 AM, Neil Perrin neil.per...@oracle.com wrote: On 10/04/12 05:30, Schweiss, Chip wrote: Thanks for all the input. It seems information on the performance of the ZIL is sparse and scattered. I've spent significant time researching this the past day. I'll summarize what I've found. Please correct me if I'm wrong. The ZIL can have any number of SSDs attached either mirror or individually. ZFS will stripe across these in a raid0 or raid10 fashion depending on how you configure. The ZIL code chains blocks together and these are allocated round robin among slogs or if they don't exist then the main pool devices. To determine the true maximum streaming performance of the ZIL setting sync=disabled will only use the in RAM ZIL. This gives up power protection to synchronous writes. There is no RAM ZIL. If sync=disabled then all writes are asynchronous and are written as part of the periodic ZFS transaction group (txg) commit that occurs every 5 seconds. Many SSDs do not help protect against power failure because they have their own ram cache for writes. This effectively makes the SSD useless for this purpose and potentially introduces a false sense of security. (These SSDs are fine for L2ARC) The ZIL code issues a write cache flush to all devices it has written before returning from the system call. I've heard, that not all devices obey the flush but we consider them as broken hardware. I don't have a list to avoid. Mirroring SSDs is only helpful if one SSD fails at the time of a power failure. This leave several unanswered questions. How good is ZFS at detecting that an SSD is no longer a reliable write target? The chance of silent data corruption is well documented about spinning disks. What chance of data corruption does this introduce with up to 10 seconds of data written on SSD. Does ZFS read the ZIL during a scrub to determine if our SSD is returning what we write to it? If the ZIL code gets a block write failure it will force the txg to commit before returning. It will depend on the drivers and IO subsystem as to how hard it tries to write the block. Zpool versions 19 and higher should be able to survive a ZIL failure only loosing the uncommitted data. However, I haven't seen good enough information that I would necessarily trust this yet. This has been available for quite a while and I haven't heard of any bugs in this area. Several threads seem to suggest a ZIL throughput limit of 1Gb/s with SSDs. I'm not sure if that is current, but I can't find any reports of better performance. I would suspect that DDR drive or Zeus RAM as ZIL would push past this. 1GB/s seems very high, but I don't have any numbers to share. It is not unusual for workloads to exceed the performance of a single device. For example, if you have a device that can achieve 700 MB/sec, but a workload generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it should be immediately obvious that the slog needs to be striped. Empirically, this is also easy to measure. -- richard Anyone care to post their performance numbers on current hardware with E5 processors, and ram based ZIL solutions? Thanks to everyone who has responded and contacted me directly on this issue. -Chip On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel andrew.gabr...@cucumber.demon.co.uk wrote: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Schweiss, Chip How can I determine for sure that my ZIL is my bottleneck? If it is the bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL to make it faster? Or should I be looking for a DDR drive, ZeusRAM, etc. Temporarily set sync=disabled Or, depending on your application, leave it that way permanently. I know, for the work I do, most systems I support at most locations have sync=disabled. It all depends on the workload. Noting of course that this means that in the case of an unexpected system outage or loss of connectivity to the disks, synchronous writes since the last txg commit will be lost, even though the applications will believe they are secured to disk. (ZFS filesystem won't be corrupted, but it will look like it's been wound back by up to 30 seconds when you reboot.) This is fine for some workloads, such as those where you would start again with fresh data and those which can look closely at the data to see how far they got before being rudely interrupted, but not for those which rely on the Posix semantics of synchronous writes/syncs meaning data is secured on non-volatile storage when the function returns. -- Andrew
Re: [zfs-discuss] Making ZIL faster
On Oct 4, 2012, at 1:33 PM, Schweiss, Chip c...@innovates.com wrote: Again thanks for the input and clarifications. I would like to clarify the numbers I was talking about with ZiL performance specs I was seeing talked about on other forums. Right now I'm getting streaming performance of sync writes at about 1 Gbit/S. My target is closer to 10Gbit/S. If I get to build it this system, it will house a decent size VMware NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe. This is all medical imaging research. We move data around by the TB and fast streaming is imperative. On the system I've been testing with is 10Gbe connected and I have about 50 VMs running very happily, and haven't yet found my random I/O limit. However every time, I storage vMotion a handful of additional VMs, the ZIL seems to max out it's writing speed to the SSDs and random I/O also suffers. With out the SSD ZIL, random I/O is very poor. I will be doing some testing with sync=off, tomorrow and see how things perform. If anyone can testify to a ZIL device(s) that can keep up with 10GBe or more streaming synchronous writes please let me know. Quick datapoint, with qty 3 ZeusRAMs as striped slog, we could push 1.3 GBytes/sec of storage vmotion on a relatively modest system. To sustain that sort of thing often requires full system-level tuning and proper systems engineering design. Fortunately, people tend to not do storage vmotion on a continuous basis. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] reminder: ZFS day next Tuesday
If you've been hiding under a rock, not checking your email, then you might not have heard about the Next Big Whopper Event for ZFS Fans: ZFS Day! The agenda is now set and the teams are preparing to descend towards San Francisco's Moscone Center vortex for a full day of ZFS. I'd love to see y'all there in person, but if you can't make it, be sure to register for the streaming video feeds. Details at: www.zfsday.com Be sure to prep your ZFS war stories for the beer bash afterwards -- thanks Delphix! -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
On Sep 26, 2012, at 10:54 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: Here's another one. Two identical servers are sitting side by side. They could be connected to each other via anything (presently using crossover ethernet cable.) And obviously they both connect to the regular LAN. You want to serve VM's from at least one of them, and even if the VM's aren't fault tolerant, you want at least the storage to be live synced. The first obvious thing to do is simply cron a zfs send | zfs receive at a very frequent interval. But there are a lot of downsides to that - besides the fact that you have to settle for some granularity, you also have a script on one system that will clobber the other system. So in the event of a failure, you might promote the backup into production, and you have to be careful not to let it get clobbered when the main server comes up again. I like much better, the idea of using a zfs mirror between the two systems. Even if it comes with a performance penalty, as a result of bottlenecking the storage onto Ethernet. But there are several ways to possibly do that, and I'm wondering which will be best. Option 1: Each system creates a big zpool of the local storage. Then, create a zvol within the zpool, and export it iscsi to the other system. Now both systems can see a local zvol, and a remote zvol, which it can use to create a zpool mirror. The reasons I don't like this idea are because it's a zpoolwithin a zpool, including the double-checksumming and everything. But the double-checksummingisn't such a concern to me - I'm mostly afraid some horrible performance or reliability problem might be resultant. Naturally, you would only zpool import the nested zpool on one system. The other system would basically just ignore it. But in the event of a primary failure, you could force import the nested zpool on the secondary system. This was described by Thorsten a few years ago. http://www.osdevcon.org/2009/slides/high_availability_with_minimal_cluster_torsten_frueauf.pdf IMHO, the issues are operational: troubleshooting could be very challenging. Option 2: At present, both systems are using local mirroring ,3 mirror pairs of 6 disks. I could break these mirrors, and export one side over to the other system... And vice versa. So neither server will be doing local mirroring; they will both be mirroring across iscsi to targets on the other host. Once again, each zpool will only be imported on one host, but in the event of a failure, you could force import it on the other host. Can anybody think of a reason why Option 2 would be stupid, or can you think of a better solution? If they are close enough for crossover cable where the cable is UTP, then they are close enough for SAS. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On Sep 26, 2012, at 4:28 AM, Sašo Kiselkov skiselkov...@gmail.com wrote: On 09/26/2012 01:14 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov Got me wondering: how many reads of a block from spinning rust suffice for it to ultimately get into L2ARC? Just one so it gets into a recent-read list of the ARC and then expires into L2ARC when ARC RAM is more needed for something else, Correct, but not always sufficient. I forget the name of the parameter, but there's some rate limiting thing that limits how fast you can fill the L2ARC. This means sometimes, things will expire from ARC, and simply get discarded. The parameters are: *) l2arc_write_max (default 8MB): max number of bytes written per fill cycle It should be noted that this level was perhaps appropriate 6 years ago, when L2ARC was integrated and given the SSDs available at the time, but is well below reasonable settings for high speed systems or modern SSDs. It is probably not a bad idea to change the default to reflect more modern systems, thus avoiding surprises. -- richard *) l2arc_headroom (default 2x): multiplies the above parameter and determines how far into the ARC lists we will search for buffers eligible for writing to L2ARC. *) l2arc_feed_secs (default 1s): regular interval between fill cycles *) l2arc_feed_min_ms (default 200ms): minimum interval between fill cycles Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cold failover of COMSTAR iSCSI targets on shared storage
On Sep 25, 2012, at 12:30 PM, Jim Klimov jimkli...@cos.ru wrote: Hello all, With original old ZFS iSCSI implementation there was a shareiscsi property for the zvols to be shared out, and I believe all configuration pertinent to the iSCSI server was stored in the pool options (I may be wrong, but I'd expect that given that ZFS-attribute-based configs were deigned to atomically import and share pools over various protocols like CIFS and NFS). With COMSTAR which is more advanced and performant, all configs seem to be in the OS config files and/or SMF service properties - not in the pool in question. Does this mean that importing a pool with iSCSI zvols on a fresh host (LiveCD instance on the same box, or via failover of shared storage to a different host) will not be able to automagically share the iSCSI targets the same way as they were known in the initial OS that created and shared them - not until an admin defines the same LUNs and WWN numbers and such, manually? Is this a correct understanding (and does the problem exist indeed), or do I (hopefully) miss something? That is pretty much how it works, with one small wrinkle -- the configuration is stored in SMF. So you can either do it the hard way (by hand), use a commercially-available HA solution (eg. RSF-1 from high-availability.com), or use SMF export/import. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS stats output - used, compressed, deduped, etc.
On Sep 25, 2012, at 11:17 AM, Jason Usher jushe...@yahoo.com wrote: Ok - but from a performance point of view, I am only using ram/cpu resources for the deduping of just the individual filesystems I enabled dedupe on, right ? I hope that turning on dedupe for just one filesystem did not incur ram/cpu costs across the entire pool... It depends. -- richard Can you elaborate at all ? Dedupe can have fairly profound performance implications, and I'd like to know if I am paying a huge price just to get a dedupe on one little filesystem ... The short answer is: deduplication transforms big I/Os into small I/Os, but does not eliminate I/O. The reason is that the deduplication table has to be updated when you write something that is deduplicated. This implies that storage devices which are inexpensive in $/GB but expensive in $/IOPS might not be the best candidates for deduplication (eg. HDDs). There is some additional CPU overhead for the sha-256 hash that might or might not be noticeable, depending on your CPU. But perhaps the most important factor is your data -- is it dedupable and are the space savings worthwhile? There is no simple answer for that, but we generally recommend that you simulate dedup before committing to it. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cold failover of COMSTAR iSCSI targets on shared storage
On Sep 25, 2012, at 1:32 PM, Jim Klimov jimkli...@cos.ru wrote: 2012-09-26 0:21, Richard Elling пишет: Does this mean that importing a pool with iSCSI zvols on a fresh host (LiveCD instance on the same box, or via failover of shared storage to a different host) will not be able to automagically share the iSCSI targets the same way as they were known in the initial OS that created and shared them - not until an admin defines the same LUNs and WWN numbers and such, manually? Is this a correct understanding (and does the problem exist indeed), or do I (hopefully) miss something? That is pretty much how it works, with one small wrinkle -- the configuration is stored in SMF. So you can either do it the hard way (by hand), use a commercially-available HA solution (eg. RSF-1 from high-availability.com http://high-availability.com), or use SMF export/import. -- richard So if I wanted to make a solution where upon import of the pool with COMSTAR shared zvols, the new host is able to publish the same resources as the previous holder of the pool media, could I get away with some scripts (on all COMSTAR servers involved) which would: 1) Regularly svccfg export certain SMF service configs to a filesystem dataset on the pool in question. This is only needed when you add a new COMSTAR share. You will also need to remove old ones. Fortunately, you have a pool where you can store these :-) 2) Upon import of the pool, such scripts would svccfg import the SMF setup, svcadm refresh and maybe svcadm restart (or svcadm enable) the iSCSI SMF services and thus share the same zvols with same settings? Import should suffice. Is this a correct understanding of doing shareiscsi for COMSTAR in the poor-man's HA setup? ;) Yes. Apparently, to be transparent for clients, this would also use VRRP or something like that to carry over the iSCSI targets' IP address(es), separate from general communications addressing of the hosts (the addressing info might also be in same dataset as SMF exports). Or just add another IP address. This is how HA systems work. Q: Which services are the complete list needed to set up the COMSTAR server from scratch? Dunno off the top of my head. Network isn't needed (COMSTAR can serve FC), but you can look at the SMF configs for details. I haven't looked at the OHAC agents in a long, long time, but you might find some scripts already built there. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS stats output - used, compressed, deduped, etc.
On Sep 25, 2012, at 1:46 PM, Jim Klimov jimkli...@cos.ru wrote: 2012-09-24 21:08, Jason Usher wrote: Ok, thank you. The problem with this is, the compressratio only goes to two significant digits, which means if I do the math, I'm only getting an approximation. Since we may use these numbers to compute billing, it is important to get it right. Is there any way at all to get the real *exact* number ? Well, if you take into account snapshots and clones, you can see really small used numbers on datasets which reference a lot of data. In fact, for accounting you might be better off with the referenced field instead of used, but note that it is not recursive and you need to account each child dataset's byte references separately. I am not sure if there is a simple way to get exact byte-counts instead of roundings like 422M... zfs get -p -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS stats output - used, compressed, deduped, etc.
On Sep 24, 2012, at 10:08 AM, Jason Usher jushe...@yahoo.com wrote: Oh, and one other thing ... --- On Fri, 9/21/12, Jason Usher jushe...@yahoo.com wrote: It shows the allocated number of bytes used by the filesystem, i.e. after compression. To get the uncompressed size, multiply used by compressratio (so for example if used=65G and compressratio=2.00x, then your decompressed size is 2.00 x 65G = 130G). Ok, thank you. The problem with this is, the compressratio only goes to two significant digits, which means if I do the math, I'm only getting an approximation. Since we may use these numbers to compute billing, it is important to get it right. Is there any way at all to get the real *exact* number ? I'm hoping the answer is yes - I've been looking but do not see it ... none can hide from dtrace! # dtrace -qn 'dsl_dataset_stats:entry {this-ds = (dsl_dataset_t *)arg0;printf(%s\tcompressed size = %d\tuncompressed size=%d\n, this-ds-ds_dir-dd_myname, this-ds-ds_phys-ds_compressed_bytes, this-ds-ds_phys-ds_uncompressed_bytes)}' openindiana-1 compressed size = 3667988992uncompressed size=3759321088 [zfs get all rpool/openindiana-1 in another shell] For reporting, the number is rounded to 2 decimal places. Ok. So the dedupratio I see for the entire pool is dedupe ratio for filesystems in this pool that have dedupe enabled ... yes ? Also, why do I not see any dedupe stats for the individual filesystem ? I see compressratio, and I see dedup=on, but I don't see any dedupratio for the filesystem itself... Ok, getting back to precise accounting ... if I turn on dedupe for a particular filesystem, and then I multiply the used property by the compressratio property, and calculate the real usage, do I need to do another calculation to account for the deduplication ? Or does the used property not take into account deduping ? So if the answer to this is yes, the used property is not only a compressed figure, but a deduped figure then I think we have a bigger problem ... You described dedupe as operating not only within the filesystem with dedup=on, but between all filesystems with dedupe enabled. Doesn't that mean that if I enabled dedupe on more than one filesystem, I can never know how much total, raw space each of those is using ? Because if the dedupe ratio is calculated across all of them, it's not the actual ratio for any one of them ... so even if I do the math, I can't decide what the total raw usage for one of them is ... right ? Correct. This is by design so that blocks shared amongst different datasets can be deduped -- the common case for things like virtual machine images. Again, if used does not reflect dedupe, and I don't need to do any math to get the raw storage figure, then it doesn't matter... Did turning on dedupe for a single filesystem turn it on for the entire pool ? In a sense, yes. The dedup machinery is pool-wide, but only writes from filesystems which have dedup enabled enter it. The rest simply pass it by and work as usual. Ok - but from a performance point of view, I am only using ram/cpu resources for the deduping of just the individual filesystems I enabled dedupe on, right ? I hope that turning on dedupe for just one filesystem did not incur ram/cpu costs across the entire pool... I also wonder about this performance question... It depends. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about ZFS snapshots
On Sep 20, 2012, at 10:05 PM, Stefan Ring stefan...@gmail.com wrote: On Fri, Sep 21, 2012 at 6:31 AM, andy thomas a...@time-domain.co.uk wrote: I have a ZFS filseystem and create weekly snapshots over a period of 5 weeks called week01, week02, week03, week04 and week05 respectively. Ny question is: how do the snapshots relate to each other - does week03 contain the changes made since week02 or does it contain all the changes made since the first snapshot, week01, and therefore includes those in week02? Every snapshot is based on the previous one and store only what is needed to capture the differences. This is not correct. Every snapshot is a complete point-in-time view of the dataset. You can send differences between snapshots that can be received, thus satisfying a requirement for incremental replication. Internally, this is easy to do because the birth order (in time) of a block is recorded in the metadata. To rollback to week03, it's necesaary to delete snapshots week04 and week05 first but what if week01 and week02 have also been deleted - will the rollback still work or is it ncessary to keep earlier snapshots? No, it's not necessary. You can rollback to any snapshot. I almost never use rollback though, in normal use. If I've accidentally deleted or overwritten something, I just rsync it over from the corresponding /.zfs/snapshots directory. Only if what I want to restore is huge, rollback might be a better option. Yes, rollback is not used very frequently. It is more common to copy out or clone the older snapshot. For example, you can clone week03, creating what is essentially a fork. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Selective zfs list
Hi Bogdan, On Sep 21, 2012, at 4:00 AM, Bogdan Ćulibrk b...@default.rs wrote: Greetings, I'm trying to achieve selective output of zfs list command for specific user to show only delegated sets. Anyone knows how to achieve this? There are several ways, but no builtin way, today. Can you provide a use case for how you want this to work? We might want to create an RFE here :-) -- richard I've checked zfs allow already but it only helps in restricting the user to create, destroy, etc something. There is no permission subcommand for listing or displaying sets. I'm on oi_151a3 bits. -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] all in one server
On Sep 18, 2012, at 7:31 AM, Eugen Leitl eu...@leitl.org wrote: Can I actually have a year's worth of snapshots in zfs without too much performance degradation? I've got 6 years of snapshots with no degradation :-) In general, there is not a direct correlation between snapshot count and performance. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zvol vs zfs send/zfs receive
On Sep 15, 2012, at 6:03 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Sat, 15 Sep 2012, Dave Pooser wrote: The problem: so far the send/recv appears to have copied 6.25TB of 5.34TB. That... doesn't look right. (Comparing zfs list -t snapshot and looking at the 5.34 ref for the snapshot vs zfs list on the new system and looking at space used.) Is this a problem? Should I be panicking yet? Does the old pool use 512 byte sectors while the new pool uses 4K sectors? Is there any change to compression settings? With volblocksize of 8k on disks with 4K sectors one might expect very poor space utilization because metadata chunks will use/waste a minimum of 4k. There might be more space consumed by the metadata than the actual data. With a zvol of 8K blocksize, 4K sector disks, and raidz you will get 12K (data plus parity) written for every block, regardless of how many disks are in the set. There will also be some metadata overhead, but I don't know of a metadata sizing formula for the general case. So the bad news is, 4K sector disks with small blocksize zvols tend to have space utilization more like mirroring. The good news is that performance is also more like mirroring. -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] scripting incremental replication data streams
On Sep 12, 2012, at 12:44 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: I send a replication data stream from one host to another. (and receive). I discovered that after receiving, I need to remove the auto-snapshot property on the receiving side, and set the readonly property on the receiving side, to prevent accidental changes (including auto-snapshots.) Question #1: Actually, do I need to remove the auto-snapshot on the receiving side? Yes Or is it sufficient to simply set the readonly property? No Will the readonly property prevent auto-snapshots from occurring? No So then, sometime later, I want to send an incremental replication stream. I need to name an incremental source snap on the sending side... which needs to be the latest matching snap that exists on both sides. Question #2: What's the best way to find the latest matching snap on both the source and destination? At present, it seems, I'll have to build a list of sender snaps, and a list of receiver snaps, and parse and search them, till I find the latest one that exists in both. For shell scripting, this is very non-trivial. Actually, it is quite easy. You will notice that zfs list -t snapshot shows the list in creation time order. If you are more paranoid, you can get the snapshot's creation time from the creation property. For convenience, zfs get -p creation ... will return the time as a number. Something like this: for i in $(zfs list -t snapshot -H -o name); do echo $(zfs get -p -H -o value creation $i) $i; done | sort -n -- richard -- illumos Day ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot used space question
For illumos-based distributions, there is a written and written@ property that shows the amount of data writtent to each snapshot. This helps to clear the confusion over the way the used property is accounted. https://www.illumos.org/issues/1645 -- richard On Aug 29, 2012, at 11:12 AM, Truhn, Chad chad.tr...@bowheadsupport.com wrote: All, I apologize in advance for what appears to be a question asked quite often, but I am not sure I have ever seen an answer that explains it. This may also be a bit long-winded so I apologize for that as well. I would like to know how much unique space each individual snapshot is using. I have a ZFS filesystem that shows: $zfs list -o space rootpool/export/home NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rootpool/export/home 5.81G 14.4G 8.81G5.54G 0 0 So reading this I see that I have a total of 14.4G of space used by this data set. Currently 5.54 is active data that is available on the normal filesystem and 8.81G used in snapshots. 8.81G + 5.54G = 14.4G (roughly). I 100% agree with these numbers and the world makes sense. This is also backed up by: $zfs get usedbysnapshots rootpool/export/home NAME PROPERTYVALUE SOURCE rootpool/export/home usedbysnapshots 8.81G - Now if I wanted to see how much space any individual snapshot is currently using I would like to think that this would show me: $zfs list -ro space rootpool/export/home NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rootpool/export/home 5.81G 14.4G 8.81G 5.54G 0 0 rootpool/export/home@week3 -202M - - - - rootpool/export/home@week2 -104M - - - - rootpool/export/home@7daysago-1.37M - - - - rootpool/export/home@6daysago-1.20M - - - - rootpool/export/home@5daysago-1020K - - - - rootpool/export/home@4daysago-342K - - - - rootpool/export/home@3daysago-1.28M - - - - rootpool/export/home@week1 -0- - - - rootpool/export/home@2daysago-0- - - - rootpool/export/home@yesterday - 360K - - - - rootpool/export/home@today-1.26M - - - - So normal logic would tell me if USEDSNAP is 8.81G and is composed of 11 snapshots, I would add up the size of each of those snapshots and that would roughly equal 8.81G. So time to break out the calculator: 202M + 104M + 1.37M + 1.20M + 1020K + 342K + 1.28M +0 +0 + 360K + 1.26M equals... ~312M! That is nowhere near 8.81G. I would accept it even if it was within 15%, but it's not even close. That definitely not metadata or ZFS overhead or anything. I understand that snapshots are just the delta between the time when the snapshot was taken and the current active filesystem and are truly just references to a block on disk rather than a copy. I also understand how two (or more) snapshots can reference the same block on a disk but yet there is still only that one block used. If I delete a recent snapshot I may not save as much space as advertised because some may be inherited by a parent snapshot. But that inheritance is not creating duplicate used space on disk so it doesn't justify the huge difference in sizes. But even with this logic in place there is currently 8.81G of blocks referred to by snapshots which are not currently on the active filesystem and I don't believe anyone can argue with that. Can something show me how much space a single snapshot has reserved? I searched through some of the archives and found this thread (http://mail.opensolaris.org/pipermail/zfs-discuss/2012-August/052163.html) from early this month and I feel as if I have the same problem as the OP, but hopefully attacking it with a little more background. I am not arguing with discrepancies between df/du and zfs output and I have read the Oracle documentation about it but haven't found what I feel like should be a simple answer. I currently have a ticket open with Oracle, but I am getting answers to all kinds of questions except for the question I am asking so I am hoping someone out there might be able to help me. I am a little concerned I am going
Re: [zfs-discuss] Dedicated metadata devices
On Aug 24, 2012, at 6:50 AM, Sašo Kiselkov wrote: This is something I've been looking into in the code and my take on your proposed points this: 1) This requires many and deep changes across much of ZFS's architecture (especially the ability to sustain tlvdev failures). 2) Most of this can be achieved (except for cache persistency) by implementing ARC space reservations for certain types of data. I think the simple solution of increasing default metadata limit above 1/4 of arc_max will take care of the vast majority of small system complaints. The limit is arbitrary and set well before dedupe was delivered. The latter has the added benefit of spreading load across all ARC and L2ARC resources, so your metaxel device never becomes the sole bottleneck and it better embraces the ZFS design philosophy of pooled storage. I plan on having a look at implementing cache management policies (which would allow for tuning space reservations for metadata/etc. in a fine-grained manner without the cruft of having to worry about physical cache devices as well). Cheers, -- Saso On 08/24/2012 03:39 PM, Jim Klimov wrote: Hello all, The idea of dedicated metadata devices (likely SSDs) for ZFS has been generically discussed a number of times on this list, but I don't think I've seen a final proposal that someone would take up for implementation (as a public source code, at least). I'd like to take a liberty of summarizing the ideas I've either seen in discussions or proposed myself on this matter, to see if the overall idea would make sense to gurus of ZFS architecture. So, the assumption was that the performance killer in ZFS at least on smallish deployments (few HDDs and an SSD accelerator), like those in Home-NAS types of boxes, was random IO to lots of metadata. It is a bad idea to make massive investments in development and testing because of an assumption. Build test cases, prove that the benefits of the investment can outweigh other alternatives, and then deliver code. -- richard This IMHO includes primarily the block pointer tree and the DDT for those who risked using dedup. I am not sure how frequent is the required read access to other types of metadata (like dataset descriptors, etc.) that the occasional reading and caching won't solve. Another idea was that L2ARC caching might not really cut it for metadata in comparison to a dedicated metadata storage, partly due to the L2ARC becoming empty upon every export/import (boot) and needing to get re-heated. So, here go the highlights of proposal (up for discussion). In short, the idea is to use today's format of the blkptr_t which by default allows to store up to 3 DVA addresses of the block, and many types of metadata use only 2 copies (at least by default). This new feature adds a specially processed TLVDEV in the common DVA address space of the pool, and enforces storage of added third copies for certain types of metadata blocks on these devices. (Limited) Backwards compatibility is quite possible, on-disk format change may be not required. The proposal also addresses some questions that arose in previous discussions, especially about proposals where SSDs would be the only storage for pool's metadata: * What if the dedicated metadata device overflows? * What if the dedicated metadata device breaks? = okay/expected by design, nothing dies. In more detail: 1) Add a special Top-Level VDEV (TLVDEV below) device type (like cache and log - say, metaxel for metadata accelerator?), and allow (even encourage) use of mirrored devices and allow expansion (raid0, raid10 and/or separate TLVDEVs) with added singlets/mirrors of such devices. Method of device type definition for the pool is discussable, I'd go with a special attribute (array) or nvlist in the pool descriptor, rather than some special type ID in the ZFS label (backwards compatibility, see point 4 for detailed rationale). Discussable: enable pool-wide or per-dataset (i.e. don't waste accelerator space and lifetime for rarely-reused datasets like rolling backups)? Choose what to store on (particular) metaxels - DDT, BPTree, something else? Overall, this availability of choice is similar to choice of modes for ARC/L2ARC caching or enabling ZIL per-dataset... 2) These devices should be formally addressable as part of the pool in DVA terms (tlvdev:offset:size), but writes onto them are artificially limited by ZFS scheduler so as to only allow specific types of metadata blocks (blkptr_t's, DDT entries), and also enforce writing of added third copies (for blocks of metadata with usual copies=2) onto these devices. 3) Absence or FAULTEDness of this device should not be fatal to the pool, but it may require manual intervention to force the import. Particularly, removal, replacement or resilvering onto different storage (i.e. migrating to larger SSDs) should
Re: [zfs-discuss] Recovering lost labels on raidz member
On Aug 13, 2012, at 2:24 AM, Sašo Kiselkov wrote: On 08/13/2012 10:45 AM, Scott wrote: Hi Saso, thanks for your reply. If all disks are the same, is the root pointer the same? No. Also, is there a signature or something unique to the root block that I can search for on the disk? I'm going through the On-disk specification at the moment. Nope. The checksums are part of the blockpointer, and the root blockpointer is in the uberblock, which itself resides in the label. By overwriting the label you've essentially erased all hope of practically finding the root of the filesystem tree - not even checksumming all possible block combinations (of which there are quite a few) will help you here, because you have no checksums to compare them against. I'd love to be wrong, and I might be (I don't have as intimate a knowledge of ZFS' on-disk structure as I'd like), but from where I'm standing, your raidz vdev is essentially lost. The labels are not identical, because each contains the guid for the device. It is possible, though nontrivial, to recreate. That said, I've never seen a failure that just takes out only the ZFS labels. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering lost labels on raidz member
On Aug 13, 2012, at 8:59 PM, Scott wrote: On Mon, Aug 13, 2012 at 10:40:45AM -0700, Richard Elling wrote: On Aug 13, 2012, at 2:24 AM, Sa?o Kiselkov wrote: On 08/13/2012 10:45 AM, Scott wrote: Hi Saso, thanks for your reply. If all disks are the same, is the root pointer the same? No. Also, is there a signature or something unique to the root block that I can search for on the disk? I'm going through the On-disk specification at the moment. Nope. The checksums are part of the blockpointer, and the root blockpointer is in the uberblock, which itself resides in the label. By overwriting the label you've essentially erased all hope of practically finding the root of the filesystem tree - not even checksumming all possible block combinations (of which there are quite a few) will help you here, because you have no checksums to compare them against. I'd love to be wrong, and I might be (I don't have as intimate a knowledge of ZFS' on-disk structure as I'd like), but from where I'm standing, your raidz vdev is essentially lost. The labels are not identical, because each contains the guid for the device. It is possible, though nontrivial, to recreate. That said, I've never seen a failure that just takes out only the ZFS labels. You'd have to go out of your way to take out the labels. Which is just what I did (imagine: moving drives over to USB external enclosures, then putting them onto a HP Raid controller (which overwrites the end of the disk) - which also assumed that two disks should be automatically mirrored (if you miss the 5 second prompt where you can tell it not to). ouch. But that shouldn't be enough. Then try and recover the labels without really knowing what you're doing (my bad). d'oh! Suffice to say I have no confidence in the labels of two drives. On OI I can forcefully import the pool but with any file that lives on multiple disks (ie, over a certain size), all I get is an I/O error. Some of datasets also fail to mount. please tell me you imported readonly? -- richard Thanks everyone for your input. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FreeBSD ZFS
On Aug 9, 2012, at 4:11 AM, joerg.schill...@fokus.fraunhofer.de (Joerg Schilling) wrote: Sa?o Kiselkov skiselkov...@gmail.com wrote: On 08/09/2012 01:05 PM, Joerg Schilling wrote: Sa?o Kiselkov skiselkov...@gmail.com wrote: To me it seems that the open-sourced ZFS community is not open, or could you point me to their mailing list archives? Jörg z...@lists.illumos.org Well, why then has there been a discussion about a closed zfs mailing list? Is this no longer true? Not that I know of. The above one is where I post my changes and Matt, George, Garrett and all the others are lurking there. So if you frequently read this list, can you tell me whether they discuss the on-disk format in this list? Yes, but nobody has posted proposals for new on-disk format changes since feature flags was first announced. NB, the z...@lists.illumos.org is but one of the many discuss groups where ZFS users can get questions answered. There is also active Mac OSX, ZFS on Linux, and OTN lists. IMHO, zfs-discuss@opensolaris is shrinking, not growing. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On Aug 2, 2012, at 5:40 PM, Nigel W wrote: On Thu, Aug 2, 2012 at 3:39 PM, Richard Elling richard.ell...@gmail.com wrote: On Aug 1, 2012, at 8:30 AM, Nigel W wrote: Yes. +1 The L2ARC as is it currently implemented is not terribly useful for storing the DDT in anyway because each DDT entry is 376 bytes but the L2ARC reference is 176 bytes, so best case you get just over double the DDT entries in the L2ARC as what you would get into the ARC but then you have also have no ARC left for anything else :(. You are making the assumption that each DDT table entry consumes one metadata update. This is not the case. The DDT is implemented as an AVL tree. As per other metadata in ZFS, the data is compressed. So you cannot make a direct correlation between the DDT entry size and the affect on the stored metadata on disk sectors. -- richard It's compressed even when in the ARC? That is a slightly odd question. The ARC contains ZFS blocks. DDT metadata is manipulated in memory as an AVL tree, so what you can see in the ARC is the metadata blocks that were read and uncompressed from the pool or packaged in blocks and written to the pool. Perhaps it is easier to think of them as metadata in transition? :-) -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On Aug 1, 2012, at 2:41 PM, Peter Jeremy wrote: On 2012-Aug-01 21:00:46 +0530, Nigel W nige...@nosun.ca wrote: I think a fantastic idea for dealing with the DDT (and all other metadata for that matter) would be an option to put (a copy of) metadata exclusively on a SSD. This is on my wishlist as well. I believe ZEVO supports it so possibly it'll be available in ZFS in the near future. ZEVO does not. The only ZFS vendor I'm aware of with a separate top-level vdev for metadata is Tegile, and it is available today. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On Aug 1, 2012, at 8:30 AM, Nigel W wrote: On Wed, Aug 1, 2012 at 8:33 AM, Sašo Kiselkov skiselkov...@gmail.com wrote: On 08/01/2012 04:14 PM, Jim Klimov wrote: chances are that some blocks of userdata might be more popular than a DDT block and would push it out of L2ARC as well... Which is why I plan on investigating implementing some tunable policy module that would allow the administrator to get around this problem. E.g. administrator dedicates 50G of ARC space to metadata (which includes the DDT) or only the DDT specifically. My idea is still a bit fuzzy, but it revolves primarily around allocating and policing min and max quotas for a given ARC entry type. I'll start a separate discussion thread for this later on once I have everything organized in my mind about where I plan on taking this. Yes. +1 The L2ARC as is it currently implemented is not terribly useful for storing the DDT in anyway because each DDT entry is 376 bytes but the L2ARC reference is 176 bytes, so best case you get just over double the DDT entries in the L2ARC as what you would get into the ARC but then you have also have no ARC left for anything else :(. You are making the assumption that each DDT table entry consumes one metadata update. This is not the case. The DDT is implemented as an AVL tree. As per other metadata in ZFS, the data is compressed. So you cannot make a direct correlation between the DDT entry size and the affect on the stored metadata on disk sectors. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to import the zpool
On Aug 1, 2012, at 12:21 AM, Suresh Kumar wrote: Dear ZFS-Users, I am using Solarisx86 10u10, All the devices which are belongs to my zpool are in available state . But I am unable to import the zpool. #zpool import tXstpool cannot import 'tXstpool': one or more devices is currently unavailable == bash-3.2# zpool import pool: tXstpool id: 13623426894836622462 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: tXstpool UNAVAIL missing device mirror-0 DEGRADED c2t210100E08BB2FC85d0s0 FAULTED corrupted data c2t21E08B92FC85d2ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. This message is your clue. The pool is missing a device. In most of the cases where I've seen this, it occurs on older ZFS implementations and the missing device is an auxiliary device: cache or spare. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] encfs on top of zfs
On Jul 31, 2012, at 8:05 PM, opensolarisisdeadlongliveopensolaris wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Richard Elling I believe what you meant to say was dedup with HDDs sux. If you had used fast SSDs instead of HDDs, you will find dedup to be quite fast. -- richard Yes, but this is a linear scale. No, it is definitely NOT a linear scale. Study Amdahl's law a little more carefully. Suppose an SSD without dedup is 100x faster than a HDD without dedup. And suppose dedup slows down a system by a factor of 10x. Now your SSD with dedup is only 10x faster than the HDD without dedup. So quite fast is a relative term. Of course it is. The SSD with dedup is still faster than the HDD without dedup, but it's also slower than the SSD without dedup. duh. With dedup you are trading IOPS for space. In general, HDDs have lots of space and terrible IOPS. SSDs have less space, but more IOPS. Obviously, as you point out, the best solution is lots of space and lots of IOPS. The extent of fibbing I'm doing is thusly: In reality, an SSD is about equally fast with HDD for sequential operations, and about 100x faster for random IO. It just so happens that the dedup performance hit is almost purely random IO, so it's right in the sweet spot of what SSD's handle well. In the vast majority of modern systems, there are no sequential I/O workloads. That is a myth propagated by people who still think HDDs can be fast. You can't use an overly simplified linear model like I described above - In reality, there's a grain of truth in what Richard said, and also a grain of truth in what I said. The real truth is somewhere in between what he said and what I said. But closer to my truth :-) No, the SSD will not perform as well with dedup as it does without dedup. But the suppose dedup slows down by 10x that I described above is not accurate. Depending on what you're doing, dedup might slow down an HDD by 20x, and it might only slow down SSD by 4x doing the same work load. Highly variable, and highly dependent on the specifics of your workload. You are making the assumption that the system is not bandwidth limited. This is a good assumption for the HDD case, because the media bandwidth is much less than the interconnect bandwidth. For SSDs, this assumption is not necessarily true. There are SSDs that are bandwidth constrained on the interconnect, and in those cases, your model fails. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Pool Unavailable
On Aug 1, 2012, at 8:04 AM, Jesse Jamez wrote: Hello, I recently rebooted my workstation and the disk names changed causing my ZFS pool to be unavailable. What OS and release? I did not make any hardware changes? My first question is the obvious? Did I loose my data? Can I recover it? Yes, just import the pool. What would cause the names to change? Delay in the order that the HBA brought them up? It depends on your OS and OBP (or BIOS). How can I correct this problem going forward? The currently imported pool configurations are recorded in the /etc/zfs/zpool.cache file for Solaris-like OSes. At boot time, the system will try to import the pools in the cache. If the cache contents no longer match reality for non-root pools, then the safest action is to not automatically import the pool. An error message is displayed and should point to a website that tells you how to correct this (NB, depending on the OS, that URL may or may not exist at Oracle (nee Sun)) -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] encfs on top of zfs
On Jul 31, 2012, at 10:07 AM, Nigel W wrote: On Tue, Jul 31, 2012 at 9:36 AM, Ray Arachelian r...@arachelian.com wrote: On 07/31/2012 09:46 AM, opensolarisisdeadlongliveopensolaris wrote: Dedup: First of all, I don't recommend using dedup under any circumstance. Not that it's unstable or anything, just that the performance is so horrible, it's never worth while. But particularly with encrypted data, you're guaranteed to have no duplicate data anyway, so it would be a pure waste. Don't do it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss One thing you can do is enable dedup when you copy all your data from one zpool to another, then, when you're done, disable dedup. It will no longer waste a ton of memory, and your new volume will have a high dedup ratio. (Obviously anything you add after you turn dedup off won't be deduped.) You can keep the old pool as a backup, or wipe it or whatever and later on do the same operation in the other direction. Once something is written deduped you will always use the memory when you want to read any files that were written when dedup was enabled, so you do not save any memory unless you do not normally access most of your data. Also don't let the system crash :D or try to delete too much from the deduped dataset :D (including snapshots or the dataset itself) because then you have to reload all (most) of the DDT in order to delete the files. This gets a lot of people in trouble (including me at $work :|) because you need to have the ram available at all times to load the most (75% to grab a number out of the air) in case the server crashes. Otherwise you are stuck with a machine trying to verify its filesystem for hours. I have one test system that has 4 GB of RAM and 2 TB of deduped data, when it crashes (panic, powerfailure, etc) it would take 8-12 hours to boot up again. It now has 1TB of data and will boot in about 5 minutes or so. I believe what you meant to say was dedup with HDDs sux. If you had used fast SSDs instead of HDDs, you will find dedup to be quite fast. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL devices and fragmentation
On Jul 30, 2012, at 10:20 AM, Roy Sigurd Karlsbakk wrote: - Opprinnelig melding - On Mon, Jul 30, 2012 at 9:38 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: Also keep in mind that if you have an SLOG (ZIL on a separate device), and then lose this SLOG (disk crash etc), you will probably lose the pool. So if you want/need SLOG, you probably want two of them in a mirror… That's only true on older versions of ZFS. ZFSv19 (or 20?) includes the ability to import a pool with a failed/missing log device. You lose any data that is in the log and not in the pool, but the pool is importable. Are you sure? I booted this v28 pool a couple of months back, and found it didn't recognize its pool, apparently because of a missing SLOG. It turned out the cache shelf was disconnected, after re-connecting it, things worked as planned. I didn't try to force a new import, though, but it didn't boot up normally, and told me it couldn't import its pool due to lack of SLOG devices. Positive. :) I tested it with ZFSv28 on FreeBSD 9-STABLE a month or two ago. See the updated man page for zpool, especially the bit about import -m. :) On 151a2, man page just says 'use this or that mountpoint' with import -m, but the fact was zpool refused to import the pool at boot when 2 SLOG devices (mirrored) and 10 L2ARC devices were offline. Should OI/Illumos be able to boot cleanly without manual action with the SLOG devices gone? No. Missing slogs is a potential data-loss condition. Importing the pool without slogs requires acceptance of the data-loss -- human interaction. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL devices and fragmentation
On Jul 30, 2012, at 12:25 PM, Tim Cook wrote: On Mon, Jul 30, 2012 at 12:44 PM, Richard Elling richard.ell...@gmail.com wrote: On Jul 30, 2012, at 10:20 AM, Roy Sigurd Karlsbakk wrote: - Opprinnelig melding - On Mon, Jul 30, 2012 at 9:38 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: Also keep in mind that if you have an SLOG (ZIL on a separate device), and then lose this SLOG (disk crash etc), you will probably lose the pool. So if you want/need SLOG, you probably want two of them in a mirror… That's only true on older versions of ZFS. ZFSv19 (or 20?) includes the ability to import a pool with a failed/missing log device. You lose any data that is in the log and not in the pool, but the pool is importable. Are you sure? I booted this v28 pool a couple of months back, and found it didn't recognize its pool, apparently because of a missing SLOG. It turned out the cache shelf was disconnected, after re-connecting it, things worked as planned. I didn't try to force a new import, though, but it didn't boot up normally, and told me it couldn't import its pool due to lack of SLOG devices. Positive. :) I tested it with ZFSv28 on FreeBSD 9-STABLE a month or two ago. See the updated man page for zpool, especially the bit about import -m. :) On 151a2, man page just says 'use this or that mountpoint' with import -m, but the fact was zpool refused to import the pool at boot when 2 SLOG devices (mirrored) and 10 L2ARC devices were offline. Should OI/Illumos be able to boot cleanly without manual action with the SLOG devices gone? No. Missing slogs is a potential data-loss condition. Importing the pool without slogs requires acceptance of the data-loss -- human interaction. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 I would think a flag to allow you to automatically continue with a disclaimer might be warranted (default behavior obviously requiring human input). Disagree, the appropriate action is to boot as far as possible. The pool will not be imported and will have the normal fault management alerts generated. For interactive use, the import will fail, and you can add the -m option. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL devices and fragmentation
On Jul 29, 2012, at 7:07 AM, Jim Klimov wrote: Hello, list For several times now I've seen statements on this list implying that a dedicated ZIL/SLOG device catching sync writes for the log, also allows for more streamlined writes to the pool during normal healthy TXG syncs, than is the case with the default ZIL located within the pool. I'm not sure where you are heading here. Space for the data in the pool is allocated based on the policies of the pool. Is this understanding correct? Does it apply to any generic writes, or only to sync-heavy scenarios like databases or NFS servers? Async writes don't use the ZIL. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL devices and fragmentation
On Jul 29, 2012, at 1:53 PM, Jim Klimov wrote: 2012-07-30 0:40, opensolarisisdeadlongliveopensolaris пишет: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov For several times now I've seen statements on this list implying that a dedicated ZIL/SLOG device catching sync writes for the log, also allows for more streamlined writes to the pool during normal healthy TXG syncs, than is the case with the default ZIL located within the pool. It might just be more clear, if it's stated differently: At any given time, your pool is in one of four states: idle, reading, writing, or idle with writes queued but not currently being written. Now a sync write operation takes place. If you have a dedicated log, it goes directly to the log, and it doesn't interfere with any of the other operations that might be occurring right now. You don't have to interrupt your current activity, simply, your sync write goes to a dedicated device that's guaranteed to be idle in relation to all that other stuff. Then the sync write becomes async, and gets coalesced into the pending TXG. If you don't have a dedicated log, then the sync write jumps the write queue, and becomes next in line. It waits for the present read or write operation to complete, and then the sync write hits the disk, and flushes the disk buffer. This means the sync write suffered a penalty waiting for the main pool disks to be interruptible. Without slog, you're causing delay to your sync writes, and you're causing delay before the next read or write operation can begin... But that's it. Without slog, your operations are serial, whereas, with slog your sync write can occur in parallel to your other operations. There's no extra fragmentation, with or without slog. Because in either case, the sync write hits some dedicated and recyclable disk blocks, and then it becomes async and coalesced with all the other async writes. The layout and/or fragmentation characteristics of the permanent TXG to be written to the pool is exactly the same either way. Thanks... but doesn't your description imply that the sync writes would always be written twice? It should be with dedicated SLOG, but even with one, I think, small writes hit the SLOG and large ones go straight to the pool devices (and smaller blocks catch up from the TXG queue upon TXG flush). However, without a dedicated SLOG, I thought that the writes into the ZIL happen once on the main pool devices, and then are referenced from the live block pointer tree without being rewritten elsewhere (and for the next TXG some other location may be used for the ZIL). Maybe I am wrong, because it would also make sense for small writes to hit the disk twice indeed, and the same pool location(s) being reused for the ZIL. You are both right and wrong, at the same time. It depends on the data. Without a slog, writes that are larger than zfs_immediate_write_sz are written to the permanent place in the pool. Please review (again) my slides on the subject. http://www.slideshare.net/relling/zfs-tutorial-lisa-2011 slide 78. For those who prefer to be lecturered, another opportunity will arise in December 2012 in San Diego at the LISA'12 conference.. I am revamping much of the material from 2011, to catch up with all of the cool new things that arrived and are due this year. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IO load questions
On Jul 25, 2012, at 7:34 AM, Matt Breitbach wrote: NFS – iSCSI and FC/FCoE to come once I get it into the proper lab. ok, so NFS for these tests. I'm not convinced a single ESXi box can drive the load to saturate 10GbE. Also, depending on how you are configuring the system, the I/O that you think is 4KB might look very different coming out of ESXi. Use nfssvrtop or one of the many dtrace one-liners for observing NFS traffic to see what is really on the wire. And I'm very interested to know if you see 16KB reads during the write-only workload. more below... From: Richard Elling [mailto:richard.ell...@gmail.com] Sent: Tuesday, July 24, 2012 11:36 PM To: matth...@flash.shanje.com Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] IO load questions Important question, what is the interconnect? iSCSI? FC? NFS? -- richard On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote: Working on a POC for high IO workloads, and I’m running in to a bottleneck that I’m not sure I can solve. Testbed looks like this : SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU’s, 72GB RAM, and ESXi VM – 4GB RAM, 1vCPU Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010 Target Nexenta system : Intel barebones, Dual Xeon 5620 CPU’s, 192GB RAM, Nexenta 3.1.3 Enterprise Intel x520 dual port 10Gbit Ethernet – LACP Active VPC to Nexus 5010 switches. 2x LSI 9201-16E HBA’s, 1x LSI 9200-8e HBA 5 DAE’s (3 in use for this test) 1 DAE – connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM SSD’s – striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC. 2 DAE’s connected (multipathed) to one LSI 9201-16E – 24x 600GB 15k Seagate Cheetah drives Obviously data integrity is not guaranteed Testing using IOMeter from windows guest, 10GB test file, queue depth of 64 I have a share set up with 4k recordsizes, compression disabled, access time disabled, and am seeing performance as follows : ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I’m guessing guest OS is bottlenecking. Going to try physical hardware next week ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5 For cases where you are not bandwidth limited, larger recordsizes can be more efficient. There is no good rule-of-thumb for this, and larger recordsizes will, at some point, hit the bandwidth bottlenecks. I've had good luck with 8KB and 32KB recordsize for ESXi+Windows over NFS. I've never bothered to test 16KB, due to lack of time. A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can’t remember CPU utilization on either side. Will retest and report those numbers. It would not surprise me to see a CPU bottleneck on the ESXi side at these levels. -- richard It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I’d really like to see closer to 10Gbit performance here, but it seems like the hardware isn’t able to cope with it? Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, unidirectional. This workload is extraordinarily difficult to achieve with a single client using any of the popular storage protocols. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IO load questions
Important question, what is the interconnect? iSCSI? FC? NFS? -- richard On Jul 24, 2012, at 9:44 AM, matth...@flash.shanje.com wrote: Working on a POC for high IO workloads, and I’m running in to a bottleneck that I’m not sure I can solve. Testbed looks like this : SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU’s, 72GB RAM, and ESXi VM – 4GB RAM, 1vCPU Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010 Target Nexenta system : Intel barebones, Dual Xeon 5620 CPU’s, 192GB RAM, Nexenta 3.1.3 Enterprise Intel x520 dual port 10Gbit Ethernet – LACP Active VPC to Nexus 5010 switches. 2x LSI 9201-16E HBA’s, 1x LSI 9200-8e HBA 5 DAE’s (3 in use for this test) 1 DAE – connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM SSD’s – striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC. 2 DAE’s connected (multipathed) to one LSI 9201-16E – 24x 600GB 15k Seagate Cheetah drives Obviously data integrity is not guaranteed Testing using IOMeter from windows guest, 10GB test file, queue depth of 64 I have a share set up with 4k recordsizes, compression disabled, access time disabled, and am seeing performance as follows : ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I’m guessing guest OS is bottlenecking. Going to try physical hardware next week ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5 A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can’t remember CPU utilization on either side. Will retest and report those numbers. It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I’d really like to see closer to 10Gbit performance here, but it seems like the hardware isn’t able to cope with it? Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, unidirectional. This workload is extraordinarily difficult to achieve with a single client using any of the popular storage protocols. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slow speed problem with a new SAS shelf
On Jul 22, 2012, at 10:18 PM, Yuri Vorobyev wrote: Hello. I faced with a strange performance problem with new disk shelf. We a using ZFS system with SATA disks for a while. What OS and release? -- richard It is Supermicro SC846-E16 chassis, Supermicro X8DTH-6F motherboard with 96Gb RAM and 24 HITACHI HDS723020BLA642 SATA disks attached to onboard LSI 2008 controller. Pretty much satisfied with it we bought additional shelf with SAS disks for VMs hosting. New shelf is Supermicro SC846-E26 chassis. Disks model is HITACHI HUS156060VLS600 (15K 600Gb SAS2). Additional controller LSI 9205-8e was installed in server and connected with JBOD. I connected JBOD with 2 channels and setup multi path first, but when i noticed performance problem i disabled multi path and disconnected one cable (for sure it is not multipath cause the problem). Problem description follow: Creating test pool with 5 pair of mirrors (new shelf, SAS disks) # zpool create -o version=28 -O primarycache=none test mirror c9t5000CCA02A138899d0 c9t5000CCA02A102181d0 mirror c9t5000CCA02A13500Dd0 c9t5000CCA02A13316Dd0 mirror c9t5000CCA02A005699d0 c9t5000CCA02A004271d0 mirror c9t5000CCA02A004229d0 c9t5000CCA02A1342CDd0 mirror c9t5000CCA02A1251E5d0 c9t5000CCA02A1151DDd0 (primarycache=none) to disable ARC influence Testing sequential write # dd if=/dev/zero of=/test/zero bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 1.04272 s, 2.1 GB/s iostat when writing look like r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1334.60.0 165782.9 0.0 8.40.06.3 1 86 c9t5000CCA02A1151DDd0 0.0 1345.50.0 169575.3 0.0 8.70.06.5 1 88 c9t5000CCA02A1342CDd0 2.0 1359.51.0 168969.8 0.0 8.70.06.4 1 90 c9t5000CCA02A13500Dd0 0.0 1358.50.0 168714.0 0.0 8.70.06.4 1 90 c9t5000CCA02A13316Dd0 0.0 1345.50.0 19.3 0.0 9.00.06.7 1 92 c9t5000CCA02A102181d0 1.0 1317.51.0 164456.9 0.0 8.50.06.5 1 88 c9t5000CCA02A004271d0 4.0 1342.52.0 166282.2 0.0 8.50.06.3 1 88 c9t5000CCA02A1251E5d0 0.0 1377.50.0 170515.5 0.0 8.70.06.3 1 90 c9t5000CCA02A138899d0 Now read # dd if=/test/zero of=/dev/null bs=1M 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 13.5681 s, 158 MB/s iostat when reading r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 106.00.0 11417.40.0 0.0 0.20.02.4 0 14 c9t5000CCA02A004271d0 80.00.0 10239.90.0 0.0 0.20.02.4 0 10 c9t5000CCA02A1251E5d0 110.00.0 12182.40.0 0.0 0.10.01.3 0 9 c9t5000CCA02A138899d0 102.00.0 11664.40.0 0.0 0.20.01.8 0 15 c9t5000CCA02A005699d0 99.00.0 10900.90.0 0.0 0.30.03.0 0 16 c9t5000CCA02A004229d0 107.00.0 11545.40.0 0.0 0.20.01.9 0 13 c9t5000CCA02A1151DDd0 81.00.0 10367.90.0 0.0 0.20.02.2 0 11 c9t5000CCA02A1342CDd0 Unexpected low speed! Note the busy column. When writing it about 90%, when reading it about 15% Individual disks raw read speed (don't be confused with name change. i connect JBOD to another HBA channel) # dd if=/dev/dsk/c8t5000CCA02A13889Ad0 of=/dev/null bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 10.9685 s, 191 MB/s # dd if=/dev/dsk/c8t5000CCA02A1342CEd0 of=/dev/null bs=1M count=2000 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 10.8024 s, 194 MB/s The 10-disks mirror zpool read slower than a single disk. There is no tuning in /etc/system I tried test with FreeBSD 8.3 live CD. Reads was the same (about 150Mb/s). Also i tried SmartOS, but it can't see disks behind LSI 9205-8e controller. For compare this is speed from SATA pool (it consist of 4 6-disk raidz2 vdev) #dd if=CentOS-6.2-x86_64-bin-DVD1.iso of=/dev/null bs=1M 4218+1 records in 4218+1 records out 4423129088 bytes (4.4 GB) copied, 4.76552 s, 928 MB/s r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 13614.40.0 800338.50.0 0.1 36.00.02.6 0 914 c6 459.90.0 25761.40.0 0.0 0.80.01.8 0 22 c6t5000CCA369D16860d0 84.00.0 2785.20.0 0.0 0.20.03.0 0 13 c6t5000CCA369D1B1E0d0 836.90.0 50089.50.0 0.0 2.60.03.1 0 60 c6t5000CCA369D1B302d0 411.00.0 24492.60.0 0.0 0.80.02.1 0 25 c6t5000CCA369D16982d0 821.90.0 49385.10.0 0.0 3.00.03.7 0 67 c6t5000CCA369CFBDA3d0 231.00.0 12292.50.0 0.0 0.50.02.3 0 18 c6t5000CCA369D17E73d0 803.90.0 50091.50.0 0.0 2.90.03.6 1 69 c6t5000CCA369D0EA93d0 PS. Before testing i flash last firmware and bios
Re: [zfs-discuss] zfs sata mirror slower than single disk
On Jul 16, 2012, at 2:43 AM, Michael Hase wrote: Hello list, did some bonnie++ benchmarks for different zpool configurations consisting of one or two 1tb sata disks (hitachi hds721010cla332, 512 bytes/sector, 7.2k), and got some strange results, please see attachements for exact numbers and pool config: seq write factor seq read factor MB/sec MB/sec single1231135 1 raid0 1141249 2 mirror 570.5 129 1 Each of the disks is capable of about 135 MB/sec sequential reads and about 120 MB/sec sequential writes, iostat -En shows no defects. Disks are 100% busy in all tests, and show normal service times. For 7,200 rpm disks, average service times should be on the order of 10ms writes and 13ms reads. If you see averages 20ms, then you are likely running into scheduling issues. -- richard This is on opensolaris 130b, rebooting with openindiana 151a live cd gives the same results, dd tests give the same results, too. Storage controller is an lsi 1068 using mpt driver. The pools are newly created and empty. atime on/off doesn't make a difference. Is there an explanation why 1) in the raid0 case the write speed is more or less the same as a single disk. 2) in the mirror case the write speed is cut by half, and the read speed is the same as a single disk. I'd expect about twice the performance for both reading and writing, maybe a bit less, but definitely more than measured. For comparison I did the same tests with 2 old 2.5 36gb sas 10k disks maxing out at about 50-60 MB/sec on the outer tracks. seq write factor seq read factor MB/sec MB/sec single 381 50 1 raid0 892111 2 mirror 361 92 2 Here we get the expected behaviour: raid0 with about double the performance for reading and writing, mirror about the same performance for writing, and double the speed for reading, compared to a single disk. An old scsi system with 4x2 mirror pairs also shows these scaling characteristics, about 450-500 MB/sec seq read and 250 MB/sec write, each disk capable of 80 MB/sec. I don't care about absolute numbers, just don't get why the sata system is so much slower than expected, especially for a simple mirror. Any ideas? Thanks, Michael -- Michael Hase http://edition-software.desata.txtsas.txt___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Thanks Sašo! Comments below... On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote: Hi guys, I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS implementation to supplant the currently utilized sha256. No need to supplant, there are 8 bits for enumerating hash algorithms, so adding another is simply a matter of coding. With the new feature flags, it is almost trivial to add new algorithms without causing major compatibility headaches. Darren points out that Oracle is considering doing the same, though I do not expect Oracle to pick up the feature flags. On modern 64-bit CPUs SHA-256 is actually much slower than SHA-512 and indeed much slower than many of the SHA-3 candidates, so I went out and did some testing (details attached) on a possible new hash algorithm that might improve on this situation. However, before I start out on a pointless endeavor, I wanted to probe the field of ZFS users, especially those using dedup, on whether their workloads would benefit from a faster hash algorithm (and hence, lower CPU utilization). Developments of late have suggested to me three possible candidates: * SHA-512: simplest to implement (since the code is already in the kernel) and provides a modest performance boost of around 60%. * Skein-512: overall fastest of the SHA-3 finalists and much faster than SHA-512 (around 120-150% faster than the current sha256). * Edon-R-512: probably the fastest general purpose hash algorithm I've ever seen (upward of 300% speedup over sha256) , but might have potential security problems (though I don't think this is of any relevance to ZFS, as it doesn't use the hash for any kind of security purposes, but only for data integrity dedup). My testing procedure: nothing sophisticated, I took the implementation of sha256 from the Illumos kernel and simply ran it on a dedicated psrset (where possible with a whole CPU dedicated, even if only to a single thread) - I tested both the generic C implementation and the Intel assembly implementation. The Skein and Edon-R implementations are in C optimized for 64-bit architectures from the respective authors (the most up to date versions I could find). All code has been compiled using GCC 3.4.3 from the repos (the same that can be used for building Illumos). Sadly, I don't have access to Sun Studio. The last studio release suitable for building OpenSolaris is available in the repo. See the instructions at http://wiki.illumos.org/display/illumos/How+To+Build+illumos I'd be curious about whether you see much difference based on studio 12.1, gcc 3.4.3 and gcc 4.4 (or even 4.7) -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Jul 11, 2012, at 10:11 AM, Bob Friesenhahn wrote: On Wed, 11 Jul 2012, Richard Elling wrote: The last studio release suitable for building OpenSolaris is available in the repo. See the instructions at http://wiki.illumos.org/display/illumos/How+To+Build+illumos Not correct as far as I can tell. You should re-read the page you referenced. Oracle recinded (or lost) the special Studio releases needed to build the OpenSolaris kernel. The only way I can see to obtain these releases is illegally. In the US, the term illegal is most often used for criminal law. Contracts between parties are covered under civil law. It is the responsibility of the parties to agree to and enforce civil contracts. This includes you, dear reader. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Jul 11, 2012, at 10:23 AM, Sašo Kiselkov wrote: Hi Richard, On 07/11/2012 06:58 PM, Richard Elling wrote: Thanks Sašo! Comments below... On Jul 10, 2012, at 4:56 PM, Sašo Kiselkov wrote: Hi guys, I'm contemplating implementing a new fast hash algorithm in Illumos' ZFS implementation to supplant the currently utilized sha256. No need to supplant, there are 8 bits for enumerating hash algorithms, so adding another is simply a matter of coding. With the new feature flags, it is almost trivial to add new algorithms without causing major compatibility headaches. Darren points out that Oracle is considering doing the same, though I do not expect Oracle to pick up the feature flags. I meant in the functional sense, not in the technical - of course, my changes would be implemented as a feature flags add-on. Great! Let's do it! -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Jul 11, 2012, at 1:06 PM, Bill Sommerfeld wrote: on a somewhat less serious note, perhaps zfs dedup should contain chinese lottery code (see http://tools.ietf.org/html/rfc3607 for one explanation) which asks the sysadmin to report a detected sha-256 collision to eprint.iacr.org or the like... Agree. George was in that section of the code a few months ago (zio.c) and I asked him to add a kstat, at least. I'll follow up with him next week, or get it done some other way. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones
To amplify what Mike says... On Jul 10, 2012, at 5:54 AM, Mike Gerdts wrote: ls(1) tells you how much data is in the file - that is, how many bytes of data that an application will see if it reads the whole file. du(1) tells you how many disk blocks are used. If you look at the stat structure in stat(2), ls reports st_size, du reports st_blocks. Blocks full of zeros are special to zfs compression - it recognizes them and stores no data. Thus, a file that contains only zeros will only require enough space to hold the file metadata. $ zfs list -o compression ./ COMPRESS on $ dd if=/dev/zero of=1gig count=1024 bs=1024k 1024+0 records in 1024+0 records out $ ls -l 1gig -rw-r--r-- 1 mgerdts staff1073741824 Jul 10 07:52 1gig ls -ls shows the length (as in -l) and size (as in -s, units=blocks) So you can see that it takes only space for metadata. 1 -rw-r--r-- 1 root root 1073741824 Nov 26 06:52 1gig size length -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scenario sanity check
First things first, the panic is a bug. Please file one with your OS supplier. More below... On Jul 6, 2012, at 4:55 PM, Ian Collins wrote: On 07/ 7/12 11:29 AM, Brian Wilson wrote: On 07/ 6/12 04:17 PM, Ian Collins wrote: On 07/ 7/12 08:34 AM, Brian Wilson wrote: Hello, I'd like a sanity check from people more knowledgeable than myself. I'm managing backups on a production system. Previously I was using another volume manager and filesystem on Solaris, and I've just switched to using ZFS. My model is - Production Server A Test Server B Mirrored storage arrays (HDS TruCopy if it matters) Backup software (TSM) Production server A sees the live volumes. Test Server B sees the TruCopy mirrors of the live volumes. (it sees the second storage array, the production server sees the primary array) Production server A shuts down zone C, and exports the zpools for zone C. Production server A splits the mirror to secondary storage array, leaving the mirror writable. Production server A re-imports the pools for zone C, and boots zone C. Test Server B imports the ZFS pool using -R /backup. Backup software backs up the mounted mirror volumes on Test Server B. Later in the day after the backups finish, a script exports the ZFS pools on test server B, and re-establishes the TruCopy mirror between the storage arrays. That looks awfully complicated. Why don't you just clone a snapshot and back up the clone? Taking a snapshot and cloning incurs IO. Backing up the clone incurs a lot more IO reading off the disks and going over the network. These aren't acceptable costs in my situation. Yet it is acceptable to shut down the zones and export the pools? I'm interested to understand how a service outage is preferred over I/O? So splitting a mirror and reconnecting it doesn't incur I/O? It does. The solution is complicated if you're starting from scratch. I'm working in an environment that already had all the pieces in place (offsite synchronous mirroring, a test server to mount stuff up on, scripts that automated the storage array mirror management, etc). It was setup that way specifically to accomplish short downtime outages for cold backups with minimal or no IO hit to production. So while it's complicated, when it was put together it was also the most obvious thing to do to drop my backup window to almost nothing, and keep all the IO from the backup from impacting production. And like I said, with a different volume manager, it's been rock solid for years. ... where data corruption is blissfully ignored? I'm not sure what volume manager you were using, but SVM has absolutely zero data integrity checking :-( And no, we do not miss using SVM :-) So, to ask the sanity check more specifically - Is it reasonable to expect ZFS pools to be exported, have their luns change underneath, then later import the same pool on those changed drives again? Yes, we do this quite frequently. And it is tested ad nauseum. Methinks it is simply a bug, perhaps one that is already fixed. If you were splitting ZFS mirrors to read data from one half all would be sweet (and you wouldn't have to export the pool). I guess the question here is what does TruCopy do under the hood when you re-connect the mirror? Yes, this is one of the use cases for zpool split. However, zpool split creates a new pool, which is not what Brian wants, because to reattach the disks requires a full resilver. Using TrueCopy as he does, is a reasonable approach for Brian's use case. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very sick iSCSI pool
Hi Ian, Chapter 7 of the DTrace book has some examples of how to look at iSCSI target and initiator behaviour. -- richard On Jun 28, 2012, at 10:47 PM, Ian Collins wrote: I'm trying to work out the case a remedy for a very sick iSCSI pool on a Solaris 11 host. The volume is exported from an Oracle storage appliance and there are no errors reported there. The host has no entries in its logs relating to the network connections. Any zfs or zpool commands the change the state of the pool (such as zfs mount or zpool export) hang and can't be killed. fmadm faulty reports: Jun 27 14:04:24 536fb2ad-1fca-c8b2-fc7d-f5a4a94c165d ZFS-8000-FDMajor Host: taitaklsc01 Platform: SUN-FIRE-X4170-M2-SERVER Chassis_id : 1142FMM02N Product_sn : 1142FMM02N Fault class : fault.fs.zfs.vdev.io Affects : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 faulted but still in service Problem in : zfs://pool=fileserver/vdev=68c1bdefa6f97db8 faulted but still in service Description : The number of I/O errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD for more information. The zpool status paints a very gloomy picture: pool: fileserver state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri Jun 29 11:59:59 2012 858K scanned out of 15.7T at 43/s, (scan is slow, no estimated time) 567K resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM fileserver ONLINE 0 1.16M 0 c0t600144F096C94AC74ECD96F20001d0 ONLINE 0 1.16M 0 (resilvering) errors: 1557164 data errors, use '-v' for a list Any ideas how to determine the cause of the problem and remedy it? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] oddity of slow zfs destroy
On Jun 25, 2012, at 10:55 AM, Philip Brown wrote: I ran into something odd today: zfs destroy -r random/filesystem is mindbogglingly slow. But seems to me, it shouldnt be. It's slow, because the filesystem has two snapshots on it. Presumably, it's busy rolling back the snapshots. but I've already declared by my command line, that I DONT CARE about the contents of the filesystem! Why doesnt zfs simply do: 1. unmount filesystem, if possible (it was possible) (1.5 possibly note intent to delete somewhere in the pool records) 2. zero out/free the in-kernel-memory in one go 3. update the pool, hey I deleted the filesystem, all these blocks are now clear Having this kind of operation take more than even 10 seconds, seems like a huge bug to me. yet it can take many minutes. An order of magnitude off. yuck. Agree. Asynchronous destroy has been integrated into illumos. Look for it soon in the distributions derived from illumos soon. For more information, see Chris Siden and Matt Ahrens discussions on async destroy and ZFS feature flags at the ZSF Meetup in January 2012 here: http://blog.delphix.com/ahl/2012/zfs10-illumos-meetup/ -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommendation for home NAS external JBOD
On Jun 20, 2012, at 4:08 AM, Jim Klimov wrote: Also by default if you don't give the whole drive to ZFS, its cache may be disabled upon pool import and you may have to reenable it manually (if you only actively use this disk for one or more ZFS pools - which play with caching nicely). This is not correct. The behaviour is to attempt to enable the disk's write cache if ZFS has the whole disk. Relevant code: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#319 Please help us to stop propagating the misinformation that ZFS disables write caches. -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommendation for home NAS external JBOD
On Jun 20, 2012, at 5:08 PM, Jim Klimov wrote: 2012-06-21 1:58, Richard Elling wrote: On Jun 20, 2012, at 4:08 AM, Jim Klimov wrote: Also by default if you don't give the whole drive to ZFS, its cache may be disabled upon pool import and you may have to reenable it The behaviour is to attempt to enable the disk's write cache if ZFS has the whole disk. Relevant code: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#319 Please help us to stop propagating the misinformation that ZFS disables write caches. -- richard I see, sorry. So, the possible states are: 1) Before pool import, disk cache was disabled; then pool is imported: 1a) If ZFS has whole disk (how is that defined BTW, since partitions and slices are really used? Is the presence of a slice#7 which is 16384 sector long the trigger?) - then cache is enabled; by the command use: zpool create c0t0d0 == whole disk zpool create c0t0d0s0 == not whole disk 1b) ZFS does not have whole disk - cache is neither enabled nor disabled; 2) Before import disk cache was enabled; after import: no change regardless of whole-diskness. correct Is this correct? How does a disk become cache disabled then - only manually? Or due to UFS usage? Or does it inherit HW setting? Or somehow else? For Sun, it was done by setting the disk firmware. I think the cache is enabled in the OS by default… In general, illumos does not touch the cache. I don't know of a way to set the cache policy in most BIOSes. In some cases, you can set it using format(1m), but whether it remains set after power-off depends on the drive manufacturer. Bottom line: don't worry about it. -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migrating 512 byte block zfs root pool to 4k disks
On Jun 15, 2012, at 7:37 AM, Hung-Sheng Tsao Ph.D. wrote: by the way when you format start with cylinder 1 donot use 0 There is no requirement for skipping cylinder 0 for root on Solaris, and there never has been. -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
[Phil beat me to it] Yes, the 0s are a result of integer division in DTrace/kernel. On Jun 14, 2012, at 9:20 PM, Timothy Coalson wrote: Indeed they are there, shown with 1 second interval. So, it is the client's fault after all. I'll have to see whether it is somehow possible to get the server to write cached data sooner (and hopefully asynchronous), and the client to issue commits less often. Luckily I can live with the current behavior (and the SSDs shouldn't give out any time soon even being used like this), if it isn't possible to change it. If this is the proposed workload, then it is possible to tune the DMU to manage commits more efficiently. In an ideal world, it does this automatically, but the algorithms are based on a bandwidth calculation and those are not suitable for HDD capacity planning. The efficiency goal would be to do less work, more often and there are two tunables that can apply: 1. the txg_timeout controls the default maximum transaction group commit interval and is set to 5 seconds on modern ZFS implementations. 2. the zfs_write_limit is a size limit for txg commit. The idea is that a txg will be committed when the size reaches this limit, rather than waiting for the txg_timeout. For streaming writes, this can work better than tuning the txg_timeout. -- richard Thanks for all the help, Tim On Thu, Jun 14, 2012 at 10:30 PM, Phil Harman phil.har...@gmail.com wrote: On 14 Jun 2012, at 23:15, Timothy Coalson tsc...@mst.edu wrote: The client is using async writes, that include commits. Sync writes do not need commits. Are you saying nfs commit operations sent by the client aren't always reported by that script? They are not reported in your case because the commit rate is less than one per second. DTrace is an amazing tool, but it does dictate certain coding compromises, particularly when it comes to output scaling, grouping, sorting and formatting. In this script the commit rate is calculated using integer division. In your case the sample interval is 5 seconds, so up to 4 commits per second will be reported as a big fat zero. If you use a sample interval of 1 second you should see occasional commits. We know they are there because we see a non-zero commit time. -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
On Jun 14, 2012, at 1:35 PM, Robert Milkowski wrote: The client is using async writes, that include commits. Sync writes do not need commits. What happens is that the ZFS transaction group commit occurs at more- or-less regular intervals, likely 5 seconds for more modern ZFS systems. When the commit occurs, any data that is in the ARC but not commited in a prior transaction group gets sent to the ZIL Are you sure? I don't think this is the case unless I misunderstood you or this is some recent change to Illumos. Need to make sure we are clear here, there is time between the txg being closed and the txg being on disk. During that period, a sync write of the data in the closed txg is written to the ZIL. Whatever is being committed when zfs txg closes goes directly to pool and not to zil. Only sync writes will go to zil right a way (and not always, see logbias, etc.) and to arc to be committed later to a pool when txg closes. In this specific case, there are separate log devices, so logbias doesn't apply. -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
On Jun 13, 2012, at 4:51 PM, Daniel Carosone wrote: On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote: client: ubuntu 11.10 /etc/fstab entry: server:/mainpool/storage /mnt/myelin nfs bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 0 nfsvers=3 NAME PROPERTY VALUE SOURCE mainpool/storage sync standard default sync=standard This is expected behaviour for this combination. NFS 3 semantics are for persistent writes at the server regardless - and mostly also for NFS 4. NB, async NFS was introduced in NFSv3. To help you easily see NFSv3/v4 async and sync activity, try nfssvrtop https://github.com/richardelling/tools -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
Hi Tim, On Jun 14, 2012, at 12:20 PM, Timothy Coalson wrote: Thanks for the script. Here is some sample output from 'sudo ./nfssvrtop -b 512 5' (my disks are 512B-sector emulated and the pool is ashift=9, some benchmarking didn't show much difference with ashift=12 other than giving up 8% of available space) during a copy operation from 37.30 with sync=standard: 2012 Jun 14 13:59:13, load: 0.68, read: 0KB, swrite: 0 KB, awrite: 557056 KB Ver Client NFSOPS Reads SWrites AWrites Commits Rd_bw SWr_bw AWr_bwRd_t SWr_t AWr_t Com_t Align% 3 xxx.xxx.37.30 108 0 0 108 00 0 111206 0 0 396 1917419 100 a bit later... 3 xxx.xxx.37.30 109 0 0 108 00 0 111411 0 0 427 0 100 sample output from the end of 'zpool iostat -v 5 mainpool' concurrently: logs - - - - - - c31t3d0s0 260M 9.68G 0 1.21K 0 85.3M c31t4d0s0 260M 9.68G 0 1.21K 0 85.1M In case the alignment fails, the nonzero entries are under NFSOPS, AWrites, AWr_bw, AWr_t, Com_t and Align%. The Com_t (average commit time?) column alternates between zero and a million or two (the other columns stay about the same, the zeros stay zero), while the Commits column stays zero during the copy. The write throughput to the logs varies quite a bit, that sample is a very high mark, it mainly alternates between almost zero and 30M each, which is kind of odd considering the copy speed (using gigabit network, copy speed averages around 110MB/s). The client is using async writes, that include commits. Sync writes do not need commits. What happens is that the ZFS transaction group commit occurs at more-or-less regular intervals, likely 5 seconds for more modern ZFS systems. When the commit occurs, any data that is in the ARC but not commited in a prior transaction group gets sent to the ZIL. This is why you might see a very different amount of ZIL activity relative to the expected write workload. When I 'zfs set sync=disabled', the output of nfssrvtop stays about the same, except the Com_t stays 0, and the log devices also stay 0 for throughput. Could you enlighten me as to what Com_t measures when Commits stays zero? Perhaps the nfs server caches asynchronous nfs writes how I expect, but flushes its cache with synchronous writes? With sync=disabled, the ZIL is not used, thus the commit response to the client is a lie, breaking the covenant between the server and client. In other words, the server is supposed to respond to the commit only when the data is written to permanent media, but the administrator overruled this action by disabling the ZIL. If the server was to unexpectedly restart or other conditions occur such that the write cannot be completed, then the server and client will have different views of the data, a form of data loss. Different applications can react to long commit times differently. In this example, we see 1.9 seconds for the commit versus about 400 microseconds for each async write. The cause of the latency of the commit is not apparent from any bandwidth measurements (eg zpool iostat) and you should consider looking more closely at the iostat -x latency to see if the log devices are performing well. -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub works in parallel?
On Jun 11, 2012, at 6:05 AM, Jim Klimov wrote: 2012-06-11 5:37, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Kalle Anka Assume we have 100 disks in one zpool. Assume it takes 5 hours to scrub one disk. If I scrub the zpool, how long time will it take? Will it scrub one disk at a time, so it will take 500 hours, i.e. in sequence, just serial? Or is it possible to run the scrub in parallel, so it takes 5h no matter how many disks? It will be approximately parallel, because it's actually scrubbing only the used blocks, and the order it scrubs in will be approximately the order they were written, which was intentionally parallel. What the other posters said, plus: 100 disks is quite a lot of contention on the bus(es), so even if it is all parallel, the bus and CPU bottlenecks would raise the scrubbing time somewhat above the single-disk scrub time. In general, this is not true for HDDs or modern CPUs. Modern systems are overprovisioned for bandwidth. In fact, bandwidth has been a poor design point for storage for a long time. Dave Patterson has some interesting observations on this, now 8 years dated. http://www.ll.mit.edu/HPEC/agendas/proc04/invited/patterson_keynote.pdf SSDs tend to be a different story, and there is some interesting work being done in this area, both on the systems side as well as the SSD side. This is where the fun work is progressing :-) -- richard -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On Jun 6, 2012, at 12:48 AM, Sašo Kiselkov wrote: So I have this dual 16-core Opteron Dell R715 with 128G of RAM attached to a SuperMicro disk enclosure with 45 2TB Toshiba SAS drives (via two LSI 9200 controllers and MPxIO) running OpenIndiana 151a4 and I'm occasionally seeing a storm of xcalls on one of the 32 VCPUs (10 xcalls a second). That isn't much of a storm, I've seen 1M xcalls in some cases... The machine is pretty much idle, only receiving a bunch of multicast video streams and dumping them to the drives (at a rate of ~40MB/s). At an interval of roughly 1-2 minutes I get a storm of xcalls that completely eat one of the CPUs, so the mpstat line for the CPU looks like: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 310 0 102191 1000000 00 100 0 0 100% busy in the system processing cross-calls. When I tried dtracing this issue, I found that this is the most likely culprit: dtrace -n 'sysinfo:::xcalls {@[stack()]=count();}' unix`xc_call+0x46 unix`hat_tlb_inval+0x283 unix`x86pte_inval+0xaa unix`hat_pte_unmap+0xed unix`hat_unload_callback+0x193 unix`hat_unload+0x41 unix`segkmem_free_vn+0x6f unix`segkmem_zio_free+0x27 genunix`vmem_xfree+0x104 genunix`vmem_free+0x29 genunix`kmem_slab_destroy+0x87 genunix`kmem_slab_free+0x2bb genunix`kmem_magazine_destroy+0x39a genunix`kmem_depot_ws_reap+0x66 genunix`taskq_thread+0x285 unix`thread_start+0x8 3221701 This happens in the sched (pid 0) process. My fsstat one looks like this: # fsstat /content 1 new name name attr attr lookup rddir read read write write file remov chng get setops ops ops bytes ops bytes 0 0 0 664 0952 0 0 0 664 38.0M /content 0 0 0 658 0935 0 0 0 656 38.6M /content 0 0 0 660 0946 0 0 0 659 37.8M /content 0 0 0 677 0969 0 0 0 676 38.5M /content What's even more puzzling is that this happens apparently entirely because of some factor other than userland, since I see no changes to CPU usage of processes in prstat(1M) when this xcall storm happens, only an increase of loadavg of +1.00 (the busy CPU). What exactly is the workload doing? Local I/O, iSCSI, NFS, or CIFS? I Googled and found that http://mail.opensolaris.org/pipermail/dtrace-discuss/2009-September/008107.html seems to have been an issue identical to mine, however, it remains unresolved at that time and it worries me about putting this kind of machine into production use. Could some ZFS guru please tell me what's going on in segkmem_zio_free? It is freeing memory. When I disable the writers to the /content filesystem, this issue goes away, so it has obviously something to do with disk IO. Thanks! Not directly related to disk I/O bandwidth. Can be directly related to other use, such as deletions -- something that causes frees. Depending on the cause, there can be some tuning that applies for large memory machines, where large is = 96 MB. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss