Re: [zfs-discuss] zfs data corruption
Just to clarify this post. This isn't data I care about recovering. I'm just interested in understanding how zfs determined there was data corruption when I have checksums disabled and there were no non-retryable read errors reported in the messages file. On Wed, Apr 23, 2008 at 9:52 PM, Victor Engle [EMAIL PROTECTED] wrote: Thanks! That would explain things. I don't believe it was a real disk read error because of the absence of evidence in /var/adm/messages. I'll review the man page and documentation to confirm that metadata is checksummed. Regards, Vic On Wed, Apr 23, 2008 at 6:30 PM, Nathan Kroenert [EMAIL PROTECTED] wrote: I'm just taking a stab here, so could be completely wrong, but IIRC, even if you disable checksum, it still checksums the metadata... So, it could be metadata checksum errors. Others on the list might have some funky zdb thingies you could to see what it actually is... Note: typed pre caffeine... :) Nathan Vic Engle wrote: I'm hoping someone can help me understand a zfs data corruption symptom. We have a zpool with checksum turned off. Zpool status shows that data corruption occured. The application using the pool at the time reported a read error and zoppl status (see below) shows 2 read errors on a device. The thing that is confusing to me is how ZFS determines that data corruption exists when reading data from a pool with checkdum turned off. Also, I'm wondering about the persistent errors in the output below. Since no specific file or directory is mentioned does this indicate pool metadata is corrupt? Thanks for any help interpreting the output... # zpool status -xv pool: zpool1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM zpool1 ONLINE 2 0 0 c4t60A9800043346859444A476B2D48446Fd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484352d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484236d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D482D6Cd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483951d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483836d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48366Bd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483551d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483435d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48326Bd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483150d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483035d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47796Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477850d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477734d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47756Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47744Fd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477333d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477169d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47704Ed0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476F33d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476D68d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476C4Ed0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476B32d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476968d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D453974d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D454142d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D454255d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D45436Dd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D487346d0 ONLINE 2 0 0 c4t60A9800043346859444A476B2D487175d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48705Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486F45d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486D74d0 ONLINE 0 0
Re: [zfs-discuss] zfs data corruption
Thanks! That would explain things. I don't believe it was a real disk read error because of the absence of evidence in /var/adm/messages. I'll review the man page and documentation to confirm that metadata is checksummed. Regards, Vic On Wed, Apr 23, 2008 at 6:30 PM, Nathan Kroenert [EMAIL PROTECTED] wrote: I'm just taking a stab here, so could be completely wrong, but IIRC, even if you disable checksum, it still checksums the metadata... So, it could be metadata checksum errors. Others on the list might have some funky zdb thingies you could to see what it actually is... Note: typed pre caffeine... :) Nathan Vic Engle wrote: I'm hoping someone can help me understand a zfs data corruption symptom. We have a zpool with checksum turned off. Zpool status shows that data corruption occured. The application using the pool at the time reported a read error and zoppl status (see below) shows 2 read errors on a device. The thing that is confusing to me is how ZFS determines that data corruption exists when reading data from a pool with checkdum turned off. Also, I'm wondering about the persistent errors in the output below. Since no specific file or directory is mentioned does this indicate pool metadata is corrupt? Thanks for any help interpreting the output... # zpool status -xv pool: zpool1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM zpool1 ONLINE 2 0 0 c4t60A9800043346859444A476B2D48446Fd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484352d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D484236d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D482D6Cd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483951d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483836d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48366Bd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483551d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483435d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48326Bd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483150d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D483035d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47796Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477850d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477734d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47756Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47744Fd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477333d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D477169d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D47704Ed0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476F33d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476D68d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476C4Ed0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476B32d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D476968d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D453974d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D454142d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D454255d0 ONLINE 0 0 0 c4t60A98000433468656834476B2D45436Dd0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D487346d0 ONLINE 2 0 0 c4t60A9800043346859444A476B2D487175d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D48705Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486F45d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486D74d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486C5Ad0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486B44d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486974d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486859d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486744d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486573d0 ONLINE 0 0 0 c4t60A9800043346859444A476B2D486459d0 ONLINE 0 0
Re: [zfs-discuss] ZFS and multipath with iSCSI
In /kernel/drv/scsi_vhci.conf you could do this load-balance=none; That way mpxio would use only one device. I imagine you need a vid/pid entry also in scsi_vhci.conf for your target. Regards, Vic On Fri, Apr 4, 2008 at 3:36 PM, Chris Siebenmann [EMAIL PROTECTED] wrote: We're currently designing a ZFS fileserver environment with iSCSI based storage (for failover, cost, ease of expansion, and so on). As part of this we would like to use multipathing for extra reliability, and I am not sure how we want to configure it. Our iSCSI backend only supports multiple sessions per target, not multiple connections per session (and my understanding is that the Solaris initiator doesn't currently support multiple connections anyways). However, we have been cautioned that there is nothing in the backend that imposes a global ordering for commands between the sessions, and so disk IO might get reordered if Solaris's multipath load balancing submits part of it to one session and part to another. So: does anyone know if Solaris's multipath and iSCSI systems already take care of this, or if ZFS already is paranoid enough to deal with this, or if we should configure Solaris multipathing to not load-balance? (A load-balanced multipath configuration is simpler for us to administer, at least until I figure out how to tell Solaris multipathing which is the preferrred network for any given iSCSI target so we can balance the overall network load by hand.) Thanks in advance. - cks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS hang and boot hang when iSCSI device removed
I don't think this is so much a ZFS problem as an iSCSI initiator problem. Are you using static configs or Send Target discovery? There are many reports of sent target discovery misbehavior in the storage discuss forum. To recover: 1. Boot into single user from CD 2. mount the root slice on /a 3. rm /etc/iscsi/* 4. reboot 5. configure iscsi static discovery for the new target IP's. A nice trick mentioned by David Weibel previously on storage discuss is to use discovery addresses to provide all the info you need to create the static configs. Just add the discovery addresses but don't enable send targets. Then run iscsiadm list discovery-address -v. The initiator will login to the discovery address and issue a send targets all command and print the results on stdout. Use the results to create the static configs and then enable static discovery. Good Luck, Vic On Feb 5, 2008 11:44 AM, Ross [EMAIL PROTECTED] wrote: We're currently evaluating ZFS prior to (hopefully) rolling it out across our server room, and have managed to lock up a server after connecting to an iSCSI target, and then changing the IP address of the target. Basically we have two test Solaris servers running, and I followed the instructions on the post below to share a zpool on Server1 using the iSCSI Target, and then import that into a new zpool on Server2. http://blogs.sun.com/chrisg/date/20070418. Everything appeared to work fine until I moved the servers to a new network (while powered on), which changed their IP addresses. The server running the iSCSI Target is still fine, it has it's IP address and from another machine I can see that the iSCSI target is still visible. However, Server2 was not as happy with the move. As far as I can tell, all ZFS commands locked up on it. I couldn't run zfs list, zpool list, zpool status or zfs iostat. Every single one locked up and I couldn't even find a way to stop them. Now I've seen a few posts about ZFS commands locking up, but this is very concerning for something we're considering using in a production system. Anyway, with Server 2 well and truly locked up, I restarted it hoping that would clear the problem (figuring ZFS would simply mark the device as offline), but found that the server can't even boot. For the past hour it has simply spammed the following message to the screen: NOTICE: iscsi connection(27) unable to connecct to target iqn.1986-03.com.sun:02:3d882af1-91cc-6d9e-9f19-edfa095fca6d Now that I wasn't expecting. This volume isn't a boot volume, the server doesn't need either ZFS or iSCSI to boot, and I don't think I even saved any data on that drive. I have found a post reporting a similar message to the above, which was reporting a ten minute boot delay with a working iSCSI volume, however I can't find anything to say what happens if the iSCSI volume is no longer there: http://forum.java.sun.com/thread.jspa?threadID=5243777messageID=10004717 So, I have quite a few questions: 1. Does anybody know how I can recover from this, or am I going to have to wipe my test server and start again? 2. How vulnerable are the ZFS admin tools to locking up like this? 3. How vulnerable is the iSCSI client to locking up like this during boot? 4. Is there any way we can disconnect the iSCSI share while ZFS is locked up like this? What could I have tried to regain control of my server before rebooting? 5. If I can get the server booted, is there any way to redirect an iSCSI volume so it's pointing at the new IP address? (I was expecting to simply do a zpool replace when ZFS reported the drive as missing). And finally, does anybody know why zpool status should lock up like this? I'm really not happy that the ZFS admin tools seem so fragile. At the very least I would have expected zpool status to be able to list the devices attached to the pools and report that they are timing out or erroring, and for me to be able to use the other ZFS tools to forcibly remove failed drives as needed. Anything less means I'm risking my whole system should ZFS find something it doesn't like. I admit I'm a solaris newbie, but surely something designed as a robust filesystem also needs robust management tools? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q : change disks to get bigger pool
Plan is to replace disks with new and larger disks. So will pool get bigger just by replasing all 4 disks one-by-one ? And if it will get larger how this should be done , fail disks one-by-one .. or ??? Or is data backup and pool recreation only way to get bigger pool There is another possibility. If you can retain the original smaller disks in the pool then you have the option of adding the additional 4 larger disks as another raidz set. In that case the command would be something like... zpool add yourpool raidz disk1 disk2 disk3 disk4 The pool would then stripe across the 2 raidz sets with more I/O to the larger raidz. In this case your new space would be immediately available. Since the 4 new disks are larger you could alternatively add a 3 disk raidz to the pool and add the 4th new disk as a spare. That way I think you could survive 2 disk failures in either pool as long as the 2nd failure didn't occur during resilver operation from the first failure. Regards, Vic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to relocate a disk
I tried taking it offline and online again, but then zpool says the disk is unavailable. Trying a zpool replace didn't work because it complains that the new disk is part of a zfs pool... So it would look like a new disk to ZFS and not like a disk belonging to a zpool. Vic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to relocate a disk
I tried taking it offline and online again, but then zpool says the disk is unavailable. Trying a zpool replace didn't work because it complains that the new disk is part of a zfs pool... So you offlined the disk and moved it to the new controller and then tried to add it back to the pool? A brute force approach might work, offline the disk, move it and format -e to restore the vtoc label and then zpool replace it. Of course it would have to resync but you would have avoided an export/import or reboot. Regards, Vic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Clearing partition/label info
Hi Al, That depends on whether you want to go back to a VTOC/SMI label or keep the EFI label created by ZFS. To keep the EFI label just repartition and use the partitions as desired. If you want to go back to a VTOC/SMI label you have to run format -e and then relabel the disk and select SMI. Be sure to run zpool destroy poolname before relabeling a lun used for zfs. To automatically recreate the default VTOC label you could incorporate the following into a script and iterate over a list of disks. 1. Create a label.dat file with the following line in it... label 0 y 2. Then execute the following format command... format -e -m -f /tmp/label cxtxdx That should apply a default VTOC SMI label. For x86 you may need run following before the format command... usr/sbin/fdisk -B cxtxdxp0 Regards, Vic On Dec 17, 2007 9:36 AM, Al Slater [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, What is the quickest way of clearing the label information on a disk that has been previously used in a zpool? regards - -- Al Slater Technical Director SCL Phone : +44 (0)1273 07 Fax : +44 (0)1273 01 email : [EMAIL PROTECTED] Stanton Consultancy Ltd Pavilion House, 6-7 Old Steine, Brighton, East Sussex, BN1 1EJ Registered in England Company number: 1957652 VAT number: GB 760 2433 55 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHZoluz4fTOFL/EDYRAnr5AJ4ie+xFNCi6gA5HLZ8IqI1wHItEEwCgj0ru EwSc9B16io3kBz2wS0LGoEQ= =eaZc -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Training
This class looks pretty good... http://www.sun.com/training/catalog/courses/SA-229-S10.xml On 10/31/07, Lisa Richards [EMAIL PROTECTED] wrote: Is there a class on ZFS installation and administration ? Lisa Richards Zykis Corporation [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
Wouldn't this be the known feature where a write error to zfs forces a panic? Vic On 10/4/07, Ben Rockwood [EMAIL PROTECTED] wrote: Dick Davies wrote: On 04/10/2007, Nathan Kroenert [EMAIL PROTECTED] wrote: Client A - import pool make couple-o-changes Client B - import pool -f (heh) Oct 4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80: Oct 4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion failed: dmu_read(os, smo-smo_object, offset, size, entry_map) == 0 (0x5 == 0x0) , file: ../../common/fs/zfs/space_map.c, line: 339 Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160 genunix:assfail3+b9 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200 zfs:space_map_load+2ef () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240 zfs:metaslab_activate+66 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300 zfs:metaslab_group_alloc+24e () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0 zfs:metaslab_alloc_dva+192 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470 zfs:metaslab_alloc+82 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0 zfs:zio_dva_allocate+68 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510 zfs:zio_checksum_generate+6e () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0 zfs:zio_write_compress+239 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0 zfs:zio_next_stage+b3 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610 zfs:zio_wait_for_children+5d () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630 zfs:zio_wait_children_ready+20 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650 zfs:zio_next_stage_async+bb () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670 zfs:zio_nowait+11 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960 zfs:dbuf_sync_leaf+1ac () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0 zfs:dbuf_sync_list+51 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10 zfs:dnode_sync+23b () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50 zfs:dmu_objset_sync_dnodes+55 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0 zfs:dmu_objset_sync+13d () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40 zfs:dsl_pool_sync+199 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0 zfs:spa_sync+1c5 () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60 zfs:txg_sync_thread+19a () Oct 4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70 unix:thread_start+8 () Oct 4 15:03:12 fozzie unix: [ID 10 kern.notice] Is this a known issue, already fixed in a later build, or should I bug it? It shouldn't panic the machine, no. I'd raise a bug. After spending a little time playing with iscsi, I have to say it's almost inevitable that someone is going to do this by accident and panic a big box for what I see as no good reason. (though I'm happy to be educated... ;) You use ACLs and TPGT groups to ensure 2 hosts can't simultaneously access the same LUN by accident. You'd have the same problem with Fibre Channel SANs. I ran into similar problems when replicating via AVS. benr. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?
Perhaps it's the same cause, I don't know... But I'm certainly not convinced that I'd be happy with a 25K, for example, panicing just because I tried to import a dud pool... I'm ok(ish) with the panic on a failed write to a non-redundant storage. I expect it by now... I agree, forcing a panic seems to be pretty severe and may cause as much grief as it prevents. Why not just stop allowing I/O to the pool so the sys admin can gracefully shutdown the system? Applications would be disrupted but no more so than they would be disrupted during a panic. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Again ZFS with expanding LUNs!
I like option #1 because it is simple and quick. It seems unlikely that this will lead to an excessive number of luns in the pool in most cases unless you start with a large number of very small luns. If you begin with 5 100GB luns and over time add 5 more it still seems like a reasonable and manageable pool with twice the original capacity. And considering the array can likely support hundreds and perhaps thousands of luns then it really isn't an issue on the array side either. Regards, Vic On 9/12/07, Bill Korb [EMAIL PROTECTED] wrote: I found this discussion just today as I recently set up my first S10 machine with ZFS. We use a NetApp Filer via multipathed FC HBAs, and I wanted to know what my options were in regards to growing a ZFS filesystem. After looking at this thread, it looks like there is currently no way to grow an existing LUN on our NetApp and then tell ZFS to expand to fill the new space. This may be coming down the road at some point, but I would like to be able to do this now. At this point, I believe I have two options: 1. Add a second LUN and simply do a zpool add to add the new space to the existing pool. 2. Create a new LUN that is the size I would like my pool to be, then use zpool replace oldLUNdev newLUNdev to ask ZFS to resilver my data to the new LUN then detach the old one. The advantage of the first option is that it happens very quickly, but it could get kind of messy if you grow the ZFS pool on multiple occasions. I've read that some SANs are also limited as to how many LUNs can be created (some are limitations of the SAN itself whereas I believe that some others impose a limit as part of the SAN license). That would also make the first approach less attractive. The advantage of the second approach is that all of the space would be contained in a single LUN. The disadvantages are that this would involve copying all of the data from the old LUN to the new one and also this means that you need to have enough free space on your SAN to create this new, larger LUN. Is there a best practice regarding this? I'm leaning towards option #2 so as to keep the number of LUNs I have to manage at a minimum, but #1 seems like a reasonable alternative, too. Or perhaps there's an option #3 that I haven't thought of? Thanks, Bill This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single SAN Lun presented to 4 Hosts
On 8/25/07, Matt B [EMAIL PROTECTED] wrote: the 4 database servers are part of an Oracle RAC configuration. 3 databases are hosted on these servers, BIGDB1 on all 4, littledb1 on the first 2, and littledb2 on the last two. The oracle backup system spawns db backup jobs that could occur on any node based on traffic and load. All nodes are fiber attached to a SAN. They all of FC access to the same set of SAN disks where the nightly dumps must go to. The plan all along was to save the gigE network for network traffic and have the nightly backups occur over the dedicated fc network. Matt, Can you just alter the backup job that oracle spawns to import the pool then do the backup and finally export the pool? Regards, Vic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Again ZFS with expanding LUNs!
I can understand lun expansion capability being an issue with more a traditional volume manager or even a single lun but with pooled storage and the ability to expand the pool, benefiting all filesystems in the pool it seems a shame to consider lun expansion a show stopper. Even so, having all the details automated and transparent to the administrator would be very cool. Regards, Vic On 8/7/07, George Wilson [EMAIL PROTECTED] wrote: I'm planning on putting back the changes to ZFS into Opensolaris in upcoming weeks. This will still require a manual step as the changes required in the sd driver are still under development. The ultimate plan is to have the entire process totally automated. If you have more questions, feel free to drop me a line. Thanks, George Yan wrote: Hey David might I need to track the evolution of that size-change utility to ZFS could I have a contact at Sun that would be able to give me more information on that ? Being able to resize LUNS dynamically Is a reality here, I currently do it with UFS after a EMC Clariion LUN Migration to a larger LUN That is our current show-stopper to using ZFS thanks Yannick This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Again ZFS with expanding LUNs!
Hi Yannick, Just to be sure I understand the restriction; with the clariion you are limited in host side volume management so that basically you use single luns with ufs filesystems on them and if you need additional space in the ufs filesystem the only option is to resize the lun on the clarion, rewrite the vtoc on the lun and then growfs? That seems like a significant limitation, especially in a very dynamic storage centric environment. Just curious, any idea how much performance suffers on the host if write coelesing is unusable? Regards, Vic On 8/7/07, Yannick Mercier [EMAIL PROTECTED] wrote: From a storage administrator perspective, when doing capacity planning on a EMC Clariion Unit, it becomes a pain to have more than one LUN for the same volume. The Clariion with Navisphere agent gives the storage administrator more visibility in the storage management interface, showing mountpoints on the hosts for each Luns The EMC CLariion Storage best practices recommends to use one LUN per volume The write coelesing feature may be unusable if using more than one lun per volume if striped in ZFS Yannick On 8/7/07, Victor Engle [EMAIL PROTECTED] wrote: I can understand lun expansion capability being an issue with more a traditional volume manager or even a single lun but with pooled storage and the ability to expand the pool, benefiting all filesystems in the pool it seems a shame to consider lun expansion a show stopper. Even so, having all the details automated and transparent to the administrator would be very cool. Regards, Vic On 8/7/07, George Wilson [EMAIL PROTECTED] wrote: I'm planning on putting back the changes to ZFS into Opensolaris in upcoming weeks. This will still require a manual step as the changes required in the sd driver are still under development. The ultimate plan is to have the entire process totally automated. If you have more questions, feel free to drop me a line. Thanks, George Yan wrote: Hey David might I need to track the evolution of that size-change utility to ZFS could I have a contact at Sun that would be able to give me more information on that ? Being able to resize LUNS dynamically Is a reality here, I currently do it with UFS after a EMC Clariion LUN Migration to a larger LUN That is our current show-stopper to using ZFS thanks Yannick This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS - SAN and Raid
On 6/20/07, Torrey McMahon [EMAIL PROTECTED] wrote: Also, how does replication at the ZFS level use more storage - I'm assuming raw block - then at the array level? ___ Just to add to the previous comments. In the case where you have a SAN array providing storage to a host for use with ZFS the SAN storage really needs to be redundant in the array AND the zpools need to be redundant pools. The reason the SAN storage should be redundant is that SAN arrays are designed to serve logical units. The logical units are usually allocated from a raid set, storage pool or aggregate of some kind. The array side pool/aggregate may include 10 300GB disks and may have 100+ luns allocated from it for example. If redundancy is not used in the array side pool/aggregate and then 1 disk failure will kill 100+ luns at once. On 6/20/07, Torrey McMahon [EMAIL PROTECTED] wrote: James C. McPherson wrote: Roshan Perera wrote: But Roshan, if your pool is not replicated from ZFS' point of view, then all the multipathing and raid controller backup in the world will not make a difference. James, I Agree from ZFS point of view. However, from the EMC or the customer point of view they want to do the replication at the EMC level and not from ZFS. By replicating at the ZFS level they will loose some storage and its doubling the replication. Its just customer use to working with Veritas and UFS and they don't want to change their habbits. I just have to convince the customer to use ZFS replication. Hi Roshan, that's a great shame because if they actually want to make use of the features of ZFS such as replication, then they need to be serious about configuring their storage to play in the ZFS world and that means replication that ZFS knows about. Also, how does replication at the ZFS level use more storage - I'm assuming raw block - then at the array level? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS - SAN and Raid
Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn't be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN-attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool consists of mirror or RAID-Z devices built using storage from SAN-attached devices, ZFS can report and correct checksum errors. This says that if we are not using ZFS raid or mirror then the expected event would be for ZFS to report but not fix the error. In our case the system kernel panicked, which is something different. Is the FAQ wrong or is there a bug in ZFS? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS - SAN and Raid
Roshan, As far as I know, there is no problem at all with using SAN storage with ZFS and it does look like you were having an underlying problem with either powerpath or the array. The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. I think first I would track down the cause of the messages just prior to the zfs write error because even with replicated pools if several devices error at once then the pool could be lost. Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Victror, Thanks for your comments but I believe it contradict what ZFS information given below and now Bruce's mail. After some digging around I found that the messages file has thrown out some powerpath errors to one of the devices that may have caused the proble. attached below the errors. But the question still remains is ZFS only happy with JBOD disks and not SAN storage with hardware raid. Thanks Roshan Jun 4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot start rdevmi pr ocess for remote shared drive operations on host su621dh01, cannot connect to vmd Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0ffe to Jun 4 16:30:12 su621dwdb last message repeated 1 time Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0fee to Jun 4 16:30:12 su621dwdb unix: [ID 836849 kern.notice] Jun 4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0: Jun 4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure (write on un known off 0: zio 600574e7500 [L0 unallocated] 4000L/400P DVA[0]=5:55c00:400 DVA[1]= 6:2b800:400 fletcher4 lzjb BE contiguous birth=107027 fill=0 cksum=673200f97f:34804a 0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5 Jun 4 16:30:12 su621dwdb unix: [ID 10 kern.notice] Jun 4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 02a101dd9740 zfs:zio_do ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0) Jun 4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice] %l0-3: 060015beaf00 0 000708fdc00 0005 0005 We have the same problem and I have just moved back to UFS because of this issue. According to the engineer at Sun that i spoke with, he implied that there is an RFE out internally that is to address this problem. The issue is this: When configuring a zpool with 1 vdev in it and zfs times out a write operation to the pool/filesystem for whatever reason, possibly just a hold back or retyrable error, the zfs module will cause a system panic because it thinks there are no other mirror's in the pool to write to and forces a kernel panic. The way around this is to configure the zpools with mirror's which negates the use of a hardware raid array, and sends twice the amount of data down to the RAID cache that is actually required (because of the mirroring at the ZFS layer). In our case it was a little old Sun StorEdge 3511 FC SATA Array, but the principle applies to any RAID arraythat is not configured as a JBOD. Victor Engle wrote: Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn't be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera [EMAIL PROTECTED] wrote: Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices? Yes, ZFS works with either direct-attached devices or SAN- attached devices. However, if your storage pool contains no mirror or RAID-Z top-level devices, ZFS can only report checksum errors but cannot correct them. If your storage pool
Re: [zfs-discuss] Re: ZFS - SAN and Raid
The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. NB. fsck is not needed for ZFS because the on-disk format is always consistent. This is orthogonal to hardware faults. I understand that the on disk state is always consistent but the self healing feature can correct blocks that have bad checksums if zfs is able to retrieve the block from a good replica. So even though the filesystem is consistent, the data can be corrupt in non-redundant pools. I am unsure of what happens with a non-redundant pool when a block has a bad checksum and perhaps you could clear that up. Does this cause a problem for the pool or is it limited to the file or files affected by the bad block and otherwise the pool is online and healthy. Thanks, Vic ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Virtual IP Integration
Well I suppose complexity is relative. Still, to use Sun Cluster at all I have to install the cluster framework on each node, correct? And even before that I have to install an interconnect with 2 switches unless I direct connect a simple 2 node cluster. My thinking was that ZFS seems to try and bundle all storage related tasks into 1 simple interface including making vfstab and dfstab entries unnecessary and considered legacy wrt ZFS. If I am using ZFS only to serve storage via IP then the only component I'm forced to manage outside of ZFS is the IP and if that's really all I want then it does seem like overkill to install, configure and administer sun cluster framework on even 2 nodes. I'm not really thinking about an application where I really need sun cluster like availability. Just the convenience factor of being able to export a pool to another system if I need to do maintenance or patching or whatever without having to go configure the other system. As it is now, the only thing I might need to do is go bring the virtual IP on the system I import the pool to. A good example would be maybe a system where I keep jumpstart images. I really don't need HA for it but simple administration is always a plus. It's an easy enough task to script I suppose but it occurred to me that it would be very convenient to have this task builtin to ZFS. Regards, Vic On 6/15/07, Richard Elling [EMAIL PROTECTED] wrote: Vic Engle wrote: Has there been any discussion here about the idea integrating a virtual IP into ZFS. It makes sense to me because of the integration of NFS and iSCSI with the sharenfs and shareiscsi properties. Since these are both dependent on an IP it would be pretty cool if there was also a virtual IP that would automatically move with the pool. Maybe something like zfs set ip.nge0=x.x.x.x mypool Or since we may have different interfaces on the nodes where we want to move the zpool... zfs set ip.server1.nge0=x.x.x.x mypool zfs set ip.server2.bge0=x.x.x.x mypool I know this could be handled with Sun Cluster but if I am only building a simple storage appliance to serve NFS and iSCSI along with CIFS via SAMBA then I don't want or need the overhead and complexity of Sun Cluster. Overhead? The complexity of a simple HA storage service is quite small. The complexity arises when you have multiple dependencies where various applications depend on local storage and other applications. (think SMF, but spread across multiple OSes). For a simple relationship such as storage--ZFS--share, there isn't much complexity. Reinventing the infrastructure needed to manage access in the face of failures is a distinctly non-trivial task. You can even begin with a single node cluster, though a virtual IP on a single node cluster isn't very interesting. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss