Re: [zfs-discuss] Changing GUID
On Nov 15, 2010, at 2:11 AM, sridhar surampudi wrote: Hi I am looking in similar lines, my requirement is 1. create a zpool on one or many devices ( LUNs ) from an array ( array can be IBM or HPEVA or EMC etc.. not SS7000). 2. Create file systems on zpool 3. Once file systems are in use (I/0 is happening) I need to take snapshot at array level a. Freeze the zfs flle system ( not required due to zfs consistency : source : mailing groups) b. take array snapshot ( say .. IBM flash copy ) c. Got new snapshot device (having same data and metadata including same GUID of source pool) Now I need a way to change the GUID and pool of snapshot device so that the snapshot device can be accessible on same host or an alternate host (if the LUN is shared). Methinks you need to understand a little bit of architecture. If you have an exact copy, then it is indistinguishable from the original. If ZFS (or insert favorite application here) sees two identical views of the data that are not, in fact, identical, then you break the assumption that the application makes. By changing the GUID you are forcing them to not be identical, which is counter to the whole point of hardware snapshots. Perhaps what you are trying to do and the method you have chosen are not compatible. BTW, I don't understand why you make a distinction between other arrays and the SS7000 above. If I make a snapshot of a zvol, then it is identical from the client's perspective, and the same conditions apply. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + L2ARC + Cluster.
comment below... On Nov 15, 2010, at 4:21 PM, Matt Banks wrote: On Nov 15, 2010, at 4:15 PM, Erik Trimble wrote: On 11/15/2010 2:55 PM, Matt Banks wrote: I asked this on the x86 mailing list (and got a it should work answer), but this is probably more of the appropriate place for it. In a 2 node Sun Cluster (3.2 running Solaris 10 u8, but could be running u9 if needed), we're looking at moving from VXFS to ZFS. However, quite frankly, part of the rationale is L2ARC. Would it be possible to use internal (SSD) storage for the L2ARC in such a scenario? My understanding is that if a ZFS filesystem is passed from one node to another, the L2ARC has to be rebuilt. So, why can't it just be rebuilt on internal storage? The nodes (x4240's) are identical and would have identical storage installed, so the paths would be the same. Has anyone done anything similar to this? I'd love something more than it should work before dropping $25k on SSD's... TIA, matt If your SSD is part of the shared storage (and, thus, visible from both nodes), then it will be part of the whole pool when exported/imported by the cluster failover software. If, on the other hand, you have an SSD in each node that is attached to the shared-storage as L2ARC, then it's not visible to the other node, and the L2ARC would have to be reattached and rebuilt in a failover senario. If you are using only X4240 systems ONLY, then you don't have ANY shared storage - ZFS isn't going to be able to failover between the two nodes. You'd have to mirror the data between the two nodes somehow; they wouldn't be part of the same zpool. Really, what you want is something like a J4000-series dual-attached to both X4240, with SSDs and HDs installed in the J4000-series chassis, not in the X4240s. Believe you me, had the standalone j4x00's not been EOL'd on 24-Sept-10 (and if they supported SSD's), or if the 2540's/2501 we have attached to this cluster supported SSD's, that would be my first choice (honestly, I LOVE the j4x00's - we get great performance out of them every time we've installed them - better at times than 2540's or 6180's). However, at this point, the only real choice we seem to have for external storage from Oracle is an F5100 or stepping up to a 6580 with a CSM2 or a 7120. The 6580 obviously ain't gonna happen and a 7120 leaves us with NFS and NFS+Solaris+Intersystems Caché has massive performance issues. The F5100 may be an option, but I'd like to explore this first. (In the interest of complete description of this particular configuration: we have 2x 2540's - one of which has a 2501 attached to it - attached to 2x x4240's. The 2540's are entirely populated with SATA 7200k rpm drives. The external file systems are VXFS at this point and managed by Volume Manager and have been in production for well over a year. When these systems were installed, ZFS still wasn't an option for us.) I'm OK having to rebuild the L2ARC cache in case of a failover. The L2ARC is rebuilt any time the pool is imported. If the L2ARC devices are not found, then the pool is still ok, but will be listed as degraded (see the definition of degraded in the zpool man page). This is harmless from a data protection viewpoint, though if you intend to run that way for a long time, you might just remove the L2ARC from the pool. In the case of clusters with the L2ARC unshared, we do support this under NexentaStor HA-Cluster and it is a fairly common case. I can't speak for what Oracle can support. -- richard They don't happen often. And it's not like this is entirely unprecedented. This is exactly the model Oracle uses for the 7000 series storage with cluster nodes. The readzillas (or whatever they're called now) are in the cluster nodes - meaning if one fails, the other takes over and has to rebuild its L2ARC. I'm talking about having an SSD (or more, but let's say 1 for simplicity's sake) in each of the x4240's. One is sitting unused in node b waiting for node a to fail. Node a's SSD is in use as L2ARC. Then, node a fails, the ZFS file systems fail over, and then node b's SSD (located at the same path as it was in node a) is used as L2ARC for the failed over file system. The $2,400 for two Marlin SSD's is a LOT less money than the $47k (incl HBA's) the lowend F5100 would run (MSRP). matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Excruciatingly slow resilvering on X4540 (build 134)
Measure the I/O performance with iostat. You should see something that looks sorta like (iostat -zxCn 10): extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 5948.9 349.3 40322.3 5238.1 0.1 16.70.02.7 0 330 c9 3.70.0 230.70.0 0.0 0.10.0 13.5 0 2 c9t1d0 845.00.0 5497.40.0 0.0 0.90.01.1 1 32 c9t2d0 3.80.0 230.70.0 0.0 0.00.0 10.6 0 1 c9t3d0 845.20.0 5495.40.0 0.0 0.90.01.1 1 32 c9t4d0 3.80.0 237.10.0 0.0 0.00.0 10.4 0 1 c9t5d0 841.40.0 5519.70.0 0.0 0.90.01.1 1 32 c9t6d0 3.80.0 237.30.0 0.0 0.00.09.2 0 1 c9t7d0 843.50.0 5485.20.0 0.0 0.90.01.1 1 31 c9t8d0 3.70.0 230.80.0 0.0 0.10.0 15.2 0 2 c9t9d0 850.20.0 5488.60.0 0.0 0.90.01.1 1 31 c9t10d0 3.10.0 211.20.0 0.0 0.00.0 13.2 0 1 c9t11d0 847.90.0 5523.40.0 0.0 0.90.01.1 1 31 c9t12d0 3.10.0 204.90.0 0.0 0.00.09.6 0 1 c9t13d0 847.20.0 5506.00.0 0.0 0.90.01.1 1 31 c9t14d0 3.40.0 224.10.0 0.0 0.00.0 12.3 0 1 c9t15d0 0.0 349.30.0 5238.1 0.0 9.90.0 28.4 1 100 c9t16d0 Here you can clearly see a raidz2 resilver in progress. c9t16d0 is the disk being resilvered (write workload) and half of the others are being read to generate the resilvering data. Note the relative performance and the ~30% busy for the surviving disks. If you see iostat output that looks significantly different than this, then you might be seeing one of two common causes: 1. Your version of ZFS has the new resilver throttle *and* the pool is otherwise servicing I/O. 2. Disks are throwing errors or responding very slowly. Use fmdump -eV to observe error reports. -- richard On Nov 1, 2010, at 12:33 PM, Mark Sandrock wrote: Hello, I'm working with someone who replaced a failed 1TB drive (50% utilized), on an X4540 running OS build 134, and I think something must be wrong. Last Tuesday afternoon, zpool status reported: scrub: resilver in progress for 306h0m, 63.87% done, 173h7m to go and a week being 168 hours, that put completion at sometime tomorrow night. However, he just reported zpool status shows: scrub: resilver in progress for 447h26m, 65.07% done, 240h10m to go so it's looking more like 2011 now. That can't be right. I'm hoping for a suggestion or two on this issue. I'd search the archives, but they don't seem searchable. Or am I wrong about that? Thanks. Mark (subscription pending) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New system, Help needed!
On Nov 15, 2010, at 8:48 AM, Frank wrote: I am a newbie on Solaris. We recently purchased a Sun Sparc M3000 server. It comes with 2 identical hard drives. I want to setup a raid 1. After searching on google, I found that the hardware raid was not working with M3000. So I am here to look for help on how to setup ZFS to use raid 1. Currently one hard drive is installed with Solaris 10 10/09, I want to setup ZFS raid 1 without reinstalling Solaris, it that possible, and how can I do that. The process is documented in the ZFS Administration Guide. http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfsadmin.pdf -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changing GUID
Actually, I did this very thing a couple of years ago with M9000s and EMC DMX4s ... with the exception of the same host requirement you have (i.e. the thing that requires the GUID change). If you want to import the pool back into the host where the cloned pool is also imported, it's not just the zpool's GUID that needs to be changed, but all the vdevs in the pool too. When I did some work on OpenSolairis in Amazon S3, I noticed that someone had build a zpool mirror split utility (before we had the real thing) as a means to clone boot disk images. IIRC it was just a hack of zdb, but with the ZFS source out there it's not that impossible to take a zpool and change all its GUIDs, it's just not that trivial (the Amazon case only handled a single simple mirrored vdev). Anyway, back to my EMC scenario... The dear data centre staff I had to work with mandated the use of good old EMC BCVs. I pointed out that ZFS's always consistent in disk promise meant that it would just work but that this required an consistent snapshot of all the LUNs in the pool (a feature in addition to basic BCVs that EMC charged even more for). Hoping to save money, my customer ignored my advice, and very quickly learned the error of their ways! The always consistent on disk promise cannot be honoured if the vdev are snapshot at different times. On a quiet system you may get lucky in simple tests, only to find that a snapshot from a busy production system causes a system panic on import (although the more recent automatic uberblock recovery may save you). The other thing I would add to your procedure is to take a ZFS snapshot just before taking the storage level snapshot. You could sync this with quiescing applications, but the real benefit is that you have a known point in time where all non-sync application level writes are temporally consistent. Phil http://harmanholistix.com On 15 Nov 2010, at 10:11, sridhar surampudi toyours_srid...@yahoo.co.in wrote: Hi I am looking in similar lines, my requirement is 1. create a zpool on one or many devices ( LUNs ) from an array ( array can be IBM or HPEVA or EMC etc.. not SS7000). 2. Create file systems on zpool 3. Once file systems are in use (I/0 is happening) I need to take snapshot at array level a. Freeze the zfs flle system ( not required due to zfs consistency : source : mailing groups) b. take array snapshot ( say .. IBM flash copy ) c. Got new snapshot device (having same data and metadata including same GUID of source pool) Now I need a way to change the GUID and pool of snapshot device so that the snapshot device can be accessible on same host or an alternate host (if the LUN is shared). Could you please post commands for the same. Regards, sridhar. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
On 11/15/10 19:36, David Magda wrote: On Mon, November 15, 2010 14:14, Darren J Moffat wrote: Today Oracle Solaris 11 Express was released and is available for download[1], this release includes on disk encryption support for ZFS. Using ZFS encryption support can be as easy as this: # zfs create -o encryption=on tank/darren Enter passphrase for 'tank/darren': Enter again: Looking forwarding to playing with it. Some questions: 1. Is it possible to do a 'zfs create -o encryption=off tank/darren/music' after the above command? I don't much care if my MP3s are encrypted. :) No, all child filesystems must be encrypted as well. This is to avoid problems with mounting during boot / pool import. It is possible this could be relaxed in the future but it is highly dependent on some other things that may not work out. 2. Both CCM and GCM modes of operation are supported: can you recommended which mode should be used when? I'm guessing it's best to accept the default if you're not sure, but what if we want to expand our knowledge? You've preempted my next planned posting ;-) But I'll attempt to give an answer here: 'on' maps to aes-128-ccm, because it is the fastest of the 6 available modes of encryption currently provided. Also I believe it is the current wisdom of cryptographers (which I do not claim to be) that AES 128 is the preferred key length due to recent discoveries about AES 256 that are not know to impact AES 128. Both CCM[1] and GCM[2] are provided so that if one turns out to have flaws hopefully the other will still be available for use safely even though they are roughly similar styles of modes. On systems without hardware/cpu support for Galios multiplication (Intel Westmere and later and SPARC T3 and later) GCM will be slower because the Galios field multiplication has to happen in software without any hardware/cpu assist. However depending on your workload you might not even notice the difference. One reason you may want to select aes-128-gcm rather than aes-128-ccm is that GCM is one of the modes for AES in NSA Suite B[3], but CCM is not. Are there symmetric algorithms other than AES that are of interest ? The wrapping key algorithm currently matches the data encryption key algorithm, is there interest in providing different wrapping key algorithms and configuration properties for selecting which one ? For example doing key wrapping with an RSA keypair/certificate ? [1] http://en.wikipedia.org/wiki/CCM_mode [2] http://en.wikipedia.org/wiki/Galois/Counter_Mode [3] http://en.wikipedia.org/wiki/NSA_Suite_B_Cryptography -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
On Nov 15, 2010, at 14:36, David Magda wrote: Looking forwarding to playing with it. Some questions: 1. Is it possible to do a 'zfs create -o encryption=off tank/darren/music' after the above command? I don't much care if my MP3s are encrypted. :) 2. Both CCM and GCM modes of operation are supported: can you recommended which mode should be used when? I'm guessing it's best to accept the default if you're not sure, but what if we want to expand our knowledge? For (2), just posted: http://blogs.sun.com/darren/entry/choosing_a_value_for_the ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool versions
Hi Ian, The pool and file system version information is available in the ZFS Administration Guide, here: http://docs.sun.com/app/docs/doc/821-1448/appendixa-1?l=ena=view The OpenSolaris version pages are up-to-date now also. Thanks, Cindy On 11/15/10 16:42, Ian Collins wrote: Is there an up to date reference following on from http://hub.opensolaris.org/bin/view/Community+Group+zfs/24 listing what's in the zpool versions up to the current 31? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool import is this safe to use -f option in this case ?
Hi, I have done the following (which is required for my case) Created a zpool (smpool) on a device/LUN from an array (IBM 6K) on host1 created a array level snapshot of the device using dscli to another device which is successful. Now I make the snapshot device visible to another host (host2) I tried zpool import smpool. Got a warning message that host1 is using this pool (might be the smpool metata data has stored this info) and asked to use -f When i tried zpool import with -f option, I am able to successfully import to host2 and able to access all file systems and snapshots. My query is in this scenario is always safe to use -f to import ?? would there be any issues ? Also I have observed that zpool import took some time to for successful completion. Is there a way minimize zpool import -f operation time ?? Regards, sridhar. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool versions
On 11/17/10 05:45 AM, Cindy Swearingen wrote: Hi Ian, The pool and file system version information is available in the ZFS Administration Guide, here: http://docs.sun.com/app/docs/doc/821-1448/appendixa-1?l=ena=view The OpenSolaris version pages are up-to-date now also. Thanks Cindy! -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot delete file when fs 100% full
Hi. I runned into that damn problem too. And after days of searching I finally found this software: Delete Long Path File Tool. It's GREAT. You can find it here: a href=http://www.deletelongfile.com;www.deletelongfile.com/a -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
tc == Tim Cook t...@cook.ms writes: tc Channeling Ethernet will not make it any faster. Each tc individual connection will be limited to 1gbit. iSCSI with tc mpxio may work, nfs will not. well...probably you will run into this problem, but it's not necessarily totally unsolved. I am just regurgitating this list again, but: need to include L4 port number in the hash: http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb port-channel load-balance mixed -- for L2 etherchannels mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) nexus makes all this more complicated. there are a few ways that seem they'd be able to accomplish ECMP: FTag flow markers in ``FabricPath'' L2 forwarding LISP MPLS the basic scheme is that the L4 hash is performed only by the edge router and used to calculate a label. The routing protocol will either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of per-entire-path ECMP for LISP and MPLS. unfortunately I don't understand these tools well enoguh to lead you further, but if you're not using infiniband and want to do 10way ECMP this is probably where you need to look. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 feature added in snv_117, NFS client connections can be spread over multiple TCP connections When rpcmod:clnt_max_conns is set to a value 1 however Even though the server is free to return data on different connections, [it does not seem to choose to actually do so] -- 6696163 fixed snv_117 nfs:nfs3_max_threads=32 in /etc/system, which changes the default 8 async threads per mount to 32. This is especially helpful for NFS over 10Gb and sun4v this stuff gets your NFS traffic onto multiple TCP circuits, which is the same thing iSCSI multipath would accomplish. From there, you still need to do the cisco/??? stuff above to get TCP circuits spread across physical paths. http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html -- suspect. it advises ``just buy 10gig'' but many other places say 10G NIC's don't perform well in real multi-core machines unless you have at least as many TCP streams as cores, which is honestly kind of obvious. lego-netadmin bias. pgputFUSXDRds.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideas for ghetto file server data reliability?
Ummm… there's a difference between data integrity and data corruption. Integrity is enforced programmatically by something like a DBMS. This sets up basic rules that ensure the programmer, program or algorithm adhere to a level of sanity and bounds. Corruption is where cosmic rays, bit rot, malware or some other item writes to the block level. ZFS protects systems from a lot of this by the way it's constructed to keep metadata, checksums, and duplicates of critical data. If the filesystem is given bad data it will faithfully lay it down on disk. If that faulty data gets corrupt, ZFS will come in and save the day. Regards, Mike On Nov 16, 2010, at 11:28, Edward Ned Harvey sh...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Toby Thain The corruption will at least be detected by a scrub, even in cases where it cannot be repaired. Not necessarily. Let's suppose you have some bad memory, and no ECC. Your application does 1 + 1 = 3. Then your application writes the answer to a file. Without ECC, the corruption happened in memory and went undetected. Then the corruption was written to file, with a correct checksum. So in fact it's not filesystem corruption, and ZFS will correctly mark the filesystem as clean and free of checksum errors. In conclusion: Use ECC if you care about your data. Do backups if you care about your data. Don't be a cheapskate, or else, don't complain when you get bitten by lack of adequate data protection. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
On Wed, Nov 17, 2010 at 7:56 AM, Miles Nordin car...@ivy.net wrote: tc == Tim Cook t...@cook.ms writes: tc Channeling Ethernet will not make it any faster. Each tc individual connection will be limited to 1gbit. iSCSI with tc mpxio may work, nfs will not. well...probably you will run into this problem, but it's not necessarily totally unsolved. I am just regurgitating this list again, but: need to include L4 port number in the hash: http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb port-channel load-balance mixed -- for L2 etherchannels mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) nexus makes all this more complicated. there are a few ways that seem they'd be able to accomplish ECMP: FTag flow markers in ``FabricPath'' L2 forwarding LISP MPLS the basic scheme is that the L4 hash is performed only by the edge router and used to calculate a label. The routing protocol will either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of per-entire-path ECMP for LISP and MPLS. unfortunately I don't understand these tools well enoguh to lead you further, but if you're not using infiniband and want to do 10way ECMP this is probably where you need to look. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 feature added in snv_117, NFS client connections can be spread over multiple TCP connections When rpcmod:clnt_max_conns is set to a value 1 however Even though the server is free to return data on different connections, [it does not seem to choose to actually do so] -- 6696163 fixed snv_117 nfs:nfs3_max_threads=32 in /etc/system, which changes the default 8 async threads per mount to 32. This is especially helpful for NFS over 10Gb and sun4v this stuff gets your NFS traffic onto multiple TCP circuits, which is the same thing iSCSI multipath would accomplish. From there, you still need to do the cisco/??? stuff above to get TCP circuits spread across physical paths. http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html -- suspect. it advises ``just buy 10gig'' but many other places say 10G NIC's don't perform well in real multi-core machines unless you have at least as many TCP streams as cores, which is honestly kind of obvious. lego-netadmin bias. AFAIK, esx/i doesn't support L4 hash, so that's a non-starter. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Any opinions on the Brocade 825 Dual port 8Gb FC HBA?
Does OpenSolaris/Solaris11 Express have a driver for it already? Anyone used one already? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Adding Sun Flash Accelerator F20's into a Zpool for Optimal Performance [SEC=UNCLASSIFIED]
Zfs Gods, I have been approved to buy 2 x F20 PCIe cards for my x4540 to increase our IOPs and I was wondering what would be the most benefit to gain extra IOPs (both reading and writing) on my zpool. Currently I have to following storage zpool, called cesspool: pool: cesspool state: ONLINE scrub: scrub completed after 14h0m with 0 errors on Sat Nov 13 18:11:29 2010 config: NAME STATE READ WRITE CKSUM cesspool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c10t0d0p0 ONLINE 0 0 0 c11t0d0p0 ONLINE 0 0 0 c12t0d0p0 ONLINE 0 0 0 c13t0d0p0 ONLINE 0 0 0 c8t1d0p0 ONLINE 0 0 0 c9t1d0p0 ONLINE 0 0 0 c10t1d0p0 ONLINE 0 0 0 c11t1d0p0 ONLINE 0 0 0 c12t1d0p0 ONLINE 0 0 0 c13t1d0p0 ONLINE 0 0 0 c8t2d0p0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c9t2d0p0 ONLINE 0 0 0 c10t2d0p0 ONLINE 0 0 0 c11t2d0p0 ONLINE 0 0 0 c12t2d0p0 ONLINE 0 0 0 c13t2d0p0 ONLINE 0 0 0 c8t3d0p0 ONLINE 0 0 0 c9t3d0p0 ONLINE 0 0 0 c10t3d0p0 ONLINE 0 0 0 c11t3d0p0 ONLINE 0 0 0 c12t3d0p0 ONLINE 0 0 0 c13t3d0p0 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 c8t4d0p0 ONLINE 0 0 0 c9t4d0p0 ONLINE 0 0 0 c10t4d0p0 ONLINE 0 0 0 c11t4d0p0 ONLINE 0 0 0 c12t4d0p0 ONLINE 0 0 0 c13t4d0p0 ONLINE 0 0 0 c8t5d0p0 ONLINE 0 0 0 c9t5d0p0 ONLINE 0 0 0 c10t5d0p0 ONLINE 0 0 0 c11t5d0p0 ONLINE 0 0 0 c12t5d0p0 ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 c13t5d0p0 ONLINE 0 0 0 c8t6d0p0 ONLINE 0 0 0 c9t6d0p0 ONLINE 0 0 0 c10t6d0p0 ONLINE 0 0 0 c11t6d0p0 ONLINE 0 0 0 c12t7d0p0 ONLINE 0 0 0 c13t6d0p0 ONLINE 0 0 0 c8t7d0p0 ONLINE 0 0 0 c9t7d0p0 ONLINE 0 0 0 c10t7d0p0 ONLINE 0 0 0 c11t7d0p0 ONLINE 0 0 0 spares c12t6d0p0AVAIL c13t7d0p0AVAIL As you would imagine with that setup, it¹s IOPs are nothing to write home about. No slogging or cache devices. If I get 2 x F20 PCIe cards, how would you recommend I use them for most benefit? I was thinking paritioning the two drives that show up form the F20, have a mirrored slogging zpool (named slogger) and use the other two vdevs to go into cesspool as cache devices. Or am I better to use 1 device soley for slogging and one for a cache devce in my pool (cesspool) ... Cause if I loose the cache device the pool still operates but slows back down? (Am I correct there?) I am getting an outage on the prod system in December, but I will test them in my backup x4500 (if I can) before the cut over on the prod system. I will also be looking to go to the latest firmware and possibly (depending on costs awaiting quote from our Sun/Oracle supplier) Solaris 11 Express ... Thanks, will appreciate any thoughts. -- Cooper Ry Lees HPC / UNIX Systems Administrator - Information Management Services (IMS) Australian Nuclear Science and Technology Organisation T +61 2 9717 3853 F +61 2 9717 9273 M +61 403 739 446 E cooper.l...@ansto.gov.au www.ansto.gov.au http://www.ansto.gov.au Important: This transmission is intended only for the use of the addressee. It is confidential and may contain privileged information or copyright material. If you are not the intended recipient, any use or further disclosure of this communication is strictly forbidden. If you have received this transmission in error, please notify me immediately by telephone and delete all copies of this transmission as well as any attachments. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import is this safe to use -f option in this case ?
sridhar, I have done the following (which is required for my case) Created a zpool (smpool) on a device/LUN from an array (IBM 6K) on host1 created a array level snapshot of the device using dscli to another device which is successful. Now I make the snapshot device visible to another host (host2) Even though the array is capable of taking device/LUN snapshots, this is a non-standard mode of operation regarding the use of ZFS. It raises concerns that if one had a problem using a ZFS in this manner, there would be few Oracle or community users of ZFS that could assist. Even if the alleged problem was not related to using ZFS with array based snapshots, usage would always create a level of uncertainty. Instead I would suggest using ZFS send / recv instead. I tried zpool import smpool. Got a warning message that host1 is using this pool (might be the smpool metata data has stored this info) and asked to use -f When i tried zpool import with -f option, I am able to successfully import to host2 and able to access all file systems and snapshots. My query is in this scenario is always safe to use -f to import ?? In this scenario, it is safe to use -f with zpool import. would there be any issues ? Prior to taking the next snapshot, one must be assured that the device/LUN on host2 is returned to the zpool export state. Failure to do this could cause zpool corruption, ZFS I/O failures, or even the possibility of a system panic on host2. Also I have observed that zpool import took some time to for successful completion. Is there a way minimize zpool import -f operation time ?? No. - Jim Regards, sridhar. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
On Nov 16, 2010, at 4:04 PM, Tim Cook t...@cook.ms wrote: On Wed, Nov 17, 2010 at 7:56 AM, Miles Nordin car...@ivy.net wrote: tc == Tim Cook t...@cook.ms writes: tc Channeling Ethernet will not make it any faster. Each tc individual connection will be limited to 1gbit. iSCSI with tc mpxio may work, nfs will not. well...probably you will run into this problem, but it's not necessarily totally unsolved. I am just regurgitating this list again, but: need to include L4 port number in the hash: http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb port-channel load-balance mixed -- for L2 etherchannels mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) nexus makes all this more complicated. there are a few ways that seem they'd be able to accomplish ECMP: FTag flow markers in ``FabricPath'' L2 forwarding LISP MPLS the basic scheme is that the L4 hash is performed only by the edge router and used to calculate a label. The routing protocol will either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of per-entire-path ECMP for LISP and MPLS. unfortunately I don't understand these tools well enoguh to lead you further, but if you're not using infiniband and want to do 10way ECMP this is probably where you need to look. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 feature added in snv_117, NFS client connections can be spread over multiple TCP connections When rpcmod:clnt_max_conns is set to a value 1 however Even though the server is free to return data on different connections, [it does not seem to choose to actually do so] -- 6696163 fixed snv_117 nfs:nfs3_max_threads=32 in /etc/system, which changes the default 8 async threads per mount to 32. This is especially helpful for NFS over 10Gb and sun4v this stuff gets your NFS traffic onto multiple TCP circuits, which is the same thing iSCSI multipath would accomplish. From there, you still need to do the cisco/??? stuff above to get TCP circuits spread across physical paths. http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html -- suspect. it advises ``just buy 10gig'' but many other places say 10G NIC's don't perform well in real multi-core machines unless you have at least as many TCP streams as cores, which is honestly kind of obvious. lego-netadmin bias. AFAIK, esx/i doesn't support L4 hash, so that's a non-starter. For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import is this safe to use -f option in this case ?
On Wed, Nov 17, 2010 at 10:12 AM, Jim Dunham james.dun...@oracle.comwrote: sridhar, I have done the following (which is required for my case) Created a zpool (smpool) on a device/LUN from an array (IBM 6K) on host1 created a array level snapshot of the device using dscli to another device which is successful. Now I make the snapshot device visible to another host (host2) Even though the array is capable of taking device/LUN snapshots, this is a non-standard mode of operation regarding the use of ZFS. It raises concerns that if one had a problem using a ZFS in this manner, there would be few Oracle or community users of ZFS that could assist. Even if the alleged problem was not related to using ZFS with array based snapshots, usage would always create a level of uncertainty. Instead I would suggest using ZFS send / recv instead. That's what we call FUD. It might be a problem if you use someone else's feature that we duplicate. If Oracle isn't going to support array-based snapshots, come right out and say it. You might as well pack up the cart now though, there isn't an enterprise array on the market that doesn't have snapshots, and you will be the ONLY OS I've ever heard of even suggesting that array-based snapshots aren't allowed. would there be any issues ? Prior to taking the next snapshot, one must be assured that the device/LUN on host2 is returned to the zpool export state. Failure to do this could cause zpool corruption, ZFS I/O failures, or even the possibility of a system panic on host2. Really? And how did you come to that conclusion? OP: Yes, you do need to use a -f. The zpool has a signature that is there when the pool is imported (this is to keep an admin from accidentally importing the pool to two different systems at the same time). The only way to clear it is to do a zpool export before taking the initial snapshot, or doing the -f on import. Jim here is doing a great job of spreading FUD, and none of it is true. What you're doing should absolutely work, just make sure there is no I/O in flight when you take the original snapshot. Either export the pool first (I would recommend this approach), shut the system down, or just make sure you aren't doing any writes when taking the array-based snapshot. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: On Nov 16, 2010, at 4:04 PM, Tim Cook t...@cook.ms wrote: AFAIK, esx/i doesn't support L4 hash, so that's a non-starter. For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. MC/S (Multiple Connections per Sessions) support was added to the iSCSI Target in COMSTAR, now available in Oracle Solaris 11 Express. - Jim -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Adding Sun Flash Accelerator F20's into a Zpool for Optimal Performance [SEC=UNCLASSIFIED]
On Wed, 17 Nov 2010, LEES, Cooper wrote: Zfs Gods, I have been approved to buy 2 x F20 PCIe cards for my x4540 to increase our IOPs and I was wondering what would be the most benefit to gain extra IOPs (both reading and writing) on my zpool. To clarify, adding a dedicated intent log (slog) only improves apparent IOPS for synchronous writes such as via NFS or a database. It will not help async writes at all unless they are contending with sync writes. A l2arc device will help with read IOPS quite a lot provided that the working set is larger than system RAM yet smaller than the l2arc device. If the working set is still much larger than RAM plus l2arc devices, then read performance may still be bottlenecked by disk. Take care not to trade IOPS gains for a data rate throughput loss. Sometimes cache devices offer less throughput than main store. There is little doubt that your pool would support more IOPS if it was based on more vdevs, containing fewer drives each. I doubt that anyone here can adequately answer your question without measurement data from the system taken while it is under the expected load. Useful tools for producing data to look at are the zilstat.ksh and arc_summary.pl scripts which you should find mentioned in the list archives. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
On Nov 16, 2010, at 7:49 PM, Jim Dunham james.dun...@oracle.com wrote: On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: On Nov 16, 2010, at 4:04 PM, Tim Cook t...@cook.ms wrote: AFAIK, esx/i doesn't support L4 hash, so that's a non-starter. For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. MC/S (Multiple Connections per Sessions) support was added to the iSCSI Target in COMSTAR, now available in Oracle Solaris 11 Express. Good to know. The only initiator I know of that supports that is Windows, but with MC/S one at least doesn't need MPIO as the initiator handles the multiplexing over the multiple connections itself. Doing multiple sessions and MPIO is supported almost universally though. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS
I've done mpxio over multiple ip links in linux using multipathd. Works just fine. It's not part of the initiator but accomplishes the same thing. It was a linux IET target. Need to try it here with a COMSTAR target. -Original Message- From: Ross Walker rswwal...@gmail.com Sender: zfs-discuss-boun...@opensolaris.org Date: Tue, 16 Nov 2010 22:05:05 To: Jim Dunhamjames.dun...@oracle.com Cc: zfs-discuss@opensolaris.orgzfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import is this safe to use -f option in this case ?
Tim, On Wed, Nov 17, 2010 at 10:12 AM, Jim Dunham james.dun...@oracle.com wrote: sridhar, I have done the following (which is required for my case) Created a zpool (smpool) on a device/LUN from an array (IBM 6K) on host1 created a array level snapshot of the device using dscli to another device which is successful. Now I make the snapshot device visible to another host (host2) Even though the array is capable of taking device/LUN snapshots, this is a non-standard mode of operation regarding the use of ZFS. It raises concerns that if one had a problem using a ZFS in this manner, there would be few Oracle or community users of ZFS that could assist. Even if the alleged problem was not related to using ZFS with array based snapshots, usage would always create a level of uncertainty. Instead I would suggest using ZFS send / recv instead. That's what we call FUD. It might be a problem if you use someone else's feature that we duplicate. If Oracle isn't going to support array-based snapshots, come right out and say it. You might as well pack up the cart now though, there isn't an enterprise array on the market that doesn't have snapshots, and you will be the ONLY OS I've ever heard of even suggesting that array-based snapshots aren't allowed. That's not what I said... Non-standard mode of operation is not the same thing as not supported. Using ZFS's standard mode of operation based on its built-in support for snapshots is well proven, well document technology. would there be any issues ? Prior to taking the next snapshot, one must be assured that the device/LUN on host2 is returned to the zpool export state. Failure to do this could cause zpool corruption, ZFS I/O failures, or even the possibility of a system panic on host2. Really? And how did you come to that conclusion? As prior developer and project lead of host-based snapshot and replication software on Solaris, I have first hand experience using ZFS with snapshots. If while ZFS on node2 is accessing an instance of snapshot data, the array updates the snapshot data, ZFS will see newly created CRCs created by node1. These CRCs will be considered as metadata corruption, and depending on exactly what ZFS was doing at the time the corruption was detected, the software attempt some form of error recovery. OP: Yes, you do need to use a -f. The zpool has a signature that is there when the pool is imported (this is to keep an admin from accidentally importing the pool to two different systems at the same time). The only way to clear it is to do a zpool export before taking the initial snapshot, or doing the -f on import. Jim here is doing a great job of spreading FUD, and none of it is true. What you're doing should absolutely work, just make sure there is no I/O in flight when you take the original snapshot. Either export the pool first (I would recommend this approach), shut the system down, or just make sure you aren't doing any writes when taking the array-based snapshot. These last two statements need clarification. ZFS is always on disk consistent, even in the context of using snapshots. Therefore as far as ZFS is concerned, there is no need to assure that there are no I/Os in flight, or that the storage pool is exported, or that the system is shutdown, or that one is doing any writes. Although ZFS is always on disk consistent, many applications are not filesystem consistent. To be filesystem consistent, an application by design must issue careful writes and/or synchronized filesystem operations. Not knowing this fact, or lacking this functionality, a system admin will need to deploy some of the work-arounds suggested above. The most important one not listed, is to stop or pause those applications which are know not to be filesystem consistent. - Jim --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
Darren J Moffat darr...@opensolaris.org writes: On 11/15/10 19:36, David Magda wrote: Using ZFS encryption support can be as easy as this: # zfs create -o encryption=on tank/darren Enter passphrase for 'tank/darren': Enter again: 2. Both CCM and GCM modes of operation are supported: can you recommended which mode should be used when? I'm guessing it's best to accept the default if you're not sure, but what if we want to expand our knowledge? You've preempted my next planned posting ;-) But I'll attempt to give an answer here: 'on' maps to aes-128-ccm, because it is the fastest of the 6 available modes of encryption currently provided. Also I believe it is the current wisdom of cryptographers (which I do not claim to be) that AES 128 is the preferred key length due to recent discoveries about AES 256 that are not know to impact AES 128. Both CCM[1] and GCM[2] are provided so that if one turns out to have flaws hopefully the other will still be available for use safely even though they are roughly similar styles of modes. On systems without hardware/cpu support for Galios multiplication (Intel Westmere and later and SPARC T3 and later) GCM will be slower because the Galios field multiplication has to happen in software without any hardware/cpu assist. However depending on your workload you might not even notice the difference. One reason you may want to select aes-128-gcm rather than aes-128-ccm is that GCM is one of the modes for AES in NSA Suite B[3], but CCM is not. Are there symmetric algorithms other than AES that are of interest ? The wrapping key algorithm currently matches the data encryption key algorithm, is there interest in providing different wrapping key algorithms and configuration properties for selecting which one ? For example doing key wrapping with an RSA keypair/certificate ? [1] http://en.wikipedia.org/wiki/CCM_mode [2] http://en.wikipedia.org/wiki/Galois/Counter_Mode [3] http://en.wikipedia.org/wiki/NSA_Suite_B_Cryptography I appreciate all the hard work the ZFS team and yourself have done to making this happen. I think a lot of people are going to give this a try but I noticed that one of the license restrictions was not to run benchmarks without prior permission from Oracle. Is Oracle going to post some benchmarks that might give people an idea of the performance using the various key lengths? Or even the performance benefit of using the newer processors with hardware support? I think a few graphs and testing procedures would be great this might be an opportunity to convince people the benefit of using sparc and Oracle hardware while at the same time giving people a basic idea what it could do for them on their own systems. I would also go as far as saying that some people would not even know how to setup a baseline to get comparative test results while using encryption. I could imagine a lot of people are curious about every aspect of performance and are thinking is ZFS encryption ready for production. I just think that some people might need that little extra nudge that a few graphs and test would provide. If it happens to also come with a few good practices you could save a lot of people some time and heart ache as I am sure people are desirous to see the results. Rthoreau ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP ProLiant N36L
I can now confirm that NexentaCore runs without a hitch on the N36L -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss