[zfs-discuss] Interesting Pool Import Failure
Hello... Since there has been much discussion about zpool import failures resulting in loss of an entire pool, I thought I would illustrate a scenario I just went through to recover a faulted pool that wouldn't import under Solaris 10 U5. While this is a simple scenario, and the data was not terribly important, I think the exercise should at least give some piece of mind to those who are not always satisfied with the idea that something that remains a problem in Solaris 10 Ux has been resolved in OpenSolaris. The end result is that I'm still running Solaris 10 U5 with a pool full of uncorrupted data. I have been evaluating a Sun Fire X4500. It came with Solaris 10 U5 pre-installed, and went through the config steps leaving most items for defaults. I created 1 zpool with 8 raidz vdevs and two spares, each vdev having target n from each controller with the exception of two vdevs which have the bootable components and the two spares.. During the evaluation it was discovered that in order to share iscsi targets with some iscsi intiators I should switch over to 2008.05. I thought I would simply just be able to install 2008.05 over top of Sol10U5 and import / upgrade the pool and move on. Indeed I probably can do that, but it didn't quite go as planned... So under U5, the boot devices (mirrored w/ SVM) were c0t1d0 and c5t0d0, I exported the zpool and rebooted. Through the ILOM's Java console I mounted the 2008.05 iso and booted from it. When the OS came online, I quickly started the installer and selected the device labeled c0t1d0 to as the installation target this was rather stupid on my part After rebooting when the install completed I quickly found myself loading Solaris 10 U5 again? Huh?!? Unfortunately, the disk naming was not consistent... 2008.05's c0t1d0 was U5's c4t0d0 a member of the first raidz vdev. Solaris 10 U5 booted up with lots of service faults about not being able to import the pool. (See the zpool import output below) Zpool import showed an additional pool called rpool (from the 2008.05 install), but I could not act on that pool as it is a more advance verion of ZFS than U5 understands. So, after discovering this, I booted back to the 2008.05 iso and destroyed the rpool on c4t0d0, thinking all might be ok afterwards with just some resilvering action being required. Booting back into U5, zpool import still reports the following... [EMAIL PROTECTED]:/]# zpool import pool: tank id: 12280374646305972114 state: FAULTED action: The pool cannot be imported due to damaged devices or data. config: tankUNAVAIL insufficient replicas raidz1UNAVAIL corrupted data c0t0d0 ONLINE c1t0d0 ONLINE c4t0d0 ONLINE c6t0d0 ONLINE c7t0d0 ONLINE . . . .snip. . . . Booting again into the iso for 2008.05, zpool import showed the pool in degraded state. I was able to replace the failed disk with a spare, and then revert back to U5 and import the pool successfully, replace the spare with the original drive, and then reconfigure the spare. So now I'm back to running Solaris 10 U5 stable with my 44 disk pool full of iscsi volumes. Point being that even if you can't run OpenSolaris due to support issues, you may still be able to use OpenSolaris to help resolve ZFS issues that you might run into in Solaris 10. Apologies for not having any command history under 2008.05 to show off. I was using the ILOM console to mount the 2008.05 iso and unfortunately there's no copy/paste between the ILOM console and my workstation. In all it was a quite simple fix, just took me a while to wrap my brain around how to go about it. -- Ignorance: America's most abundant and costly commodity. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Do you grok it?
[EMAIL PROTECTED] wrote on 09/15/2008 11:32:15 PM: Brandon High wrote: On Fri, Sep 12, 2008 at 11:49 AM, Dale Ghent [EMAIL PROTECTED] wrote: Did I detect a (well-done) metaphor for shared ZFS? Probably not. It looks like a deduplication / MAID solution. Yeah, I think they blew it on the colors of the gumballs in the machine :-) But other than that, it looks like a very interesting machine. -- richard I am looking forward to clause 3.1 (source code Availability) of CDDL as from their marketing descriptions they appear to have implemented some form of in-ZFS hash de-dupe. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting Pool Import Failure
s == Solaris [EMAIL PROTECTED] writes: s Point being that even if you can't run OpenSolaris due to s support issues, you may still be able to use OpenSolaris to s help resolve ZFS issues that you might run into in Solaris 10. glad ZFS is improving, but this sentence is a fantastic bit of Newthink. pgpZS9eQQdNQL.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] A few questions
2008/9/15 gm_sjo: 2008/9/15 Ben Rockwood: On Thumpers I've created single pools of 44 disks, in 11 disk RAIDZ2's. I've come to regret this. I recommend keeping pools reasonably sized and to keep stripes thinner than this. Could you clarify why you came to regret it? I was intending to create a single pool for 8 1TB disks. Sorry, just bouncing the back for Ben incase he missed it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] iscsi target problems on snv_97
I've recently upgraded my x4500 to Nevada build 97, and am having problems with the iscsi target. Background: this box is used to serve NFS underlying a VMware ESX environment (zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) for a Windows host and to act as zoneroots for Solaris 10 hosts. For optimal random-read performance, I've configured a single zfs pool of mirrored VDEVs of all 44 disks (+2 boot disks, +2 spares = 48) Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool would stop occasionally. Since the upgrade, that hasn't happened, and the NFS clients are quite happy. The iSCSI initiators are not. The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on Windows 2003 SP2 x64 Enterprise Edition. When the system reboots, it is not able to connect to its iscsi targets. No devices are found until I restart the iscsitgt process on the x4500, at which point the initiator will reconnect and find everything. I notice that on the x4500, it maintains an active TCP connection (according to netstat -an | grep 3260) to the Windows box through the reboot and for a long time afterwards. The initiator starts a second connection, but it seems that the target doesn't let go of the old one. Or something. At this point, every time I reboot the Windows system I have to `pkill iscsitgtd` The Solaris system is running S10 Update 4. Every once in a while (twice today, and not correlated with the pkill's above) the system reports that all of the iscsi disks are unavailable. Nothing I've tried short of a reboot of the whole host brings them back. All of the zones on the system remount their zoneroots read-only (and give I/O errors when read or zlogin'd to) There are a set of TCP connections from the zonehost to the x4500 that remain even through disabling the iscsi_initiator service. There's no process holding them as far as pfiles can tell. Does this sound familiar to anyone? Any suggestions on what I can do to troubleshoot further? I have a kernel dump from the zonehost and a snoop capture of the wire for the Windows host (but it's big). I'll be opening a bug too. Thanks, --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] A few questions
On Tue, Sep 16, 2008 at 10:03 PM, Ben Rockwood [EMAIL PROTECTED] wrote: gm_sjo wrote: 2008/9/15 gm_sjo: 2008/9/15 Ben Rockwood: On Thumpers I've created single pools of 44 disks, in 11 disk RAIDZ2's. I've come to regret this. I recommend keeping pools reasonably sized and to keep stripes thinner than this. Could you clarify why you came to regret it? I was intending to create a single pool for 8 1TB disks. Sorry, just bouncing the back for Ben incase he missed it. No, I didn't miss it, just was hoping I could get some benchmarking in to justify my points. You want to keep stripes wide to reduce wasted disk space but you also want to keep them narrow to reduce the elements involved in parity calculation. In light home use I don't see a problem with an 8 disk RAIDZ/RAIDZ2. If your serving in a multi-user environment your primary concern is to reduce the movement of the disk heads, and thus narrower stripes become adventagious. I'm not sure that the width of the stripe is directly a problem. But what is true is that the random read performance of raidz1/2 is basically that of a single drive, so having more vdevs is better. Given a fixed number of drives, more vdevs implies narrower stripes, but that's a side-effect rather than a cause. For what it's worth, we put all the disks on our thumpers into a single pool - mostly it's 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the OS and would happily go much bigger. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZPOOL Import Problem
jd == Jim Dunham [EMAIL PROTECTED] writes: jd If at the time the SNDR replica is deleted the set was jd actively replicating, along with ZFS actively writing to the jd ZFS storage pool, I/O consistency will be lost, leaving ZFS jd storage pool in an indeterministic state on the remote node. jd To address this issue, prior to deleting the replicas, the jd replica should be placed into logging mode first. What if you stop the replication by breaking the network connection between primary and replica? consistent or inconsistent? it sounds fishy, like ``we're always-consistent-on-disk with ZFS, but please use 'zpool offline' to avoid disastrous pool corruption.'' jd ndr_ii. This is an automatic snapshot taken before jd resynchronization starts, yeah that sounds fine, possibly better than DRBD in one way because it might allow the resync to go faster. From the PDF's it sounds like async replication isn't done the same way as the resync, it's done safely, and that it's even possible for async replication to accumulate hours of backlog in a ``disk queue'' without losing write ordering so long as you use the ``blocking mode'' variant of async. ii might also be good for debugging a corrupt ZFS, so you can tinker with it but still roll back to the original corrupt copy. I'll read about it---I'm guessing I will need to prepare ahead of time if I want ii available in the toolbox after a disaster. jd AVS has the concept of I/O consistency groups, where all disks jd of a multi-volume filesystem (ZFS, QFS) or database (Oracle, jd Sybase) are kept write-order consistent when using either sync jd or async replication. Awesome, so long as people know to use it. so I guess that's the answer for the OP: use consistency groups! The one thing I worry about is, before, AVS was used between RAID and filesystem, which is impossible now because that inter-layer area n olonger exists. If you put the individual device members of a redundant zpool vdev into an AVS consistency group, what will AVS do when one of the devices fails? Does it continue replicating the working devices and ignore the failed one? This would sacrifice redundancy at the DR site. UFS-AVS-RAID would not do that in the same situation. Or hide the failed device from ZFS and slow things down by sending all read/writes of the failed device to the remote mirror? This would slwo down the primary site. UFS-AVS-RAID would not do that in the same situation. The latter ZFS-AVS behavior might be rescueable, if ZFS had the statistical read-preference feature. but writes would still be massively slowed with this scenario, while in UFS-AVS-RAID they would not be. To get back the level of control one used to have for writes, you'd need a different zpool-level way to achieve the intent of the AVS sync/async option. Maybe just a slog which is not AVS-replicated would be enough, modulo other ZFS fixes for hiding slow devices. pgpzm3T09CxRc.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] iscsi target problems on snv_97
Moore, Joe wrote: I've recently upgraded my x4500 to Nevada build 97, and am having problems with the iscsi target. Background: this box is used to serve NFS underlying a VMware ESX environment (zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) for a Windows host and to act as zoneroots for Solaris 10 hosts. For optimal random-read performance, I've configured a single zfs pool of mirrored VDEVs of all 44 disks (+2 boot disks, +2 spares = 48) Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool would stop occasionally. Since the upgrade, that hasn't happened, and the NFS clients are quite happy. The iSCSI initiators are not. The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on Windows 2003 SP2 x64 Enterprise Edition. When the system reboots, it is not able to connect to its iscsi targets. No devices are found until I restart the iscsitgt process on the x4500, at which point the initiator will reconnect and find everything. I notice that on the x4500, it maintains an active TCP connection (according to netstat -an | grep 3260) to the Windows box through the reboot and for a long time afterwards. The initiator starts a second connection, but it seems that the target doesn't let go of the old one. Or something. At this point, every time I reboot the Windows system I have to `pkill iscsitgtd` The Solaris system is running S10 Update 4. Every once in a while (twice today, and not correlated with the pkill's above) the system reports that all of the iscsi disks are unavailable. Nothing I've tried short of a reboot of the whole host brings them back. All of the zones on the system remount their zoneroots read-only (and give I/O errors when read or zlogin'd to) There are a set of TCP connections from the zonehost to the x4500 that remain even through disabling the iscsi_initiator service. There's no process holding them as far as pfiles can tell. Does this sound familiar to anyone? Any suggestions on what I can do to troubleshoot further? I have a kernel dump from the zonehost and a snoop capture of the wire for the Windows host (but it's big). I believe the problem you're seeing might be related to deadlock condition (CR 6745310), if you run pstack on the iscsi target daemon you might find a bunch of zombie threads. The fix is putback to snv-99, give snv-99 a try. -Tim I'll be opening a bug too. Thanks, --Joe ___ storage-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/storage-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] A few questions
On Tue, Sep 16, 2008 at 2:28 PM, Peter Tribble [EMAIL PROTECTED] wrote: For what it's worth, we put all the disks on our thumpers into a single pool - mostly it's 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the OS and would happily go much bigger. so you have 9 drive raidz1 (8 disks usable) + hot spare, or 8 drive raidz1 (7 disks usable) + hot spare? It sounds like people -can- build larger pools but due to their storage needs (performance, availability, etc) choose NOT to. For home usage with maybe 4 clients maximum and can deal with downtime when swapping out a drive, I think I can live with decent performance (not insane) and try to maximize my space (without making ZFS's redundancy features useless.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4500 vs AVS ?
Sorry, I popped up to Hokkdaido for a holiday. I want to thank you all for the replies. I mentioned AVS as I thought it to do be the only product close to enabling us to do a (makeshift) fail-over setup. We have 5-6 ZFS filesystem, and 5-6 zvol with UFS (for quotas). To do zfs send snapshots every minute might perhaps be possible (just not very attractive), but if the script dies at any time, you need to resend the full volumes, this currently takes 5 days. (Even using nc). Since we are forced by vendor to run Sol10, it sounds like AVS is not an option for us. If we were interested in finding a method to replicate data to a 2nd x4500, what other options are there for us? We do not need instant updates, just someplace to fail-over to when the x4500 panics, or a HDD dies. (Which equals panic) It currently takes 2 hours to fsck the UFS volumes after a panic (and yes, they are logging; it is actually just the one UFS volume that always needs fsck). Vendor has mentioned VeritasVolumReplicator but I was under the impression that Veritas is a whole different set to zfs/zpool. Lund Jim Dunham wrote: On Sep 11, 2008, at 5:16 PM, A Darren Dunham wrote: On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote: On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote: On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote: The issue with any form of RAID 1, is that the instant a disk fails out of the RAID set, with the next write I/O to the remaining members of the RAID set, the failed disk (and its replica) are instantly out of sync. Does raidz fall into that category? Yes. The key reason is that as soon as ZFS (or other mirroring software) detects a disk failure in a RAID 1 set, it will stop writing to the failed disk, which also means it will also stop writing to the replica of the failed disk. From the point of view of the remote node, the replica of the failed disk is no longer being updated. Now if replication was stopped, or the primary node powered off or panicked, during the import of the ZFS storage pool on the secondary node, the replica of the failed disk must not be part of the ZFS storage pool as its data is stale. This happens automatically, since the ZFS metadata on the remaining disks have already given up on this member of the RAID set. Then I misunderstood what you were talking about. Why the restriction on RAID 1 for your statement? No restriction. I meant to say, RAID 1 or greater. Even for a mirror, the data is stale and it's removed from the active set. I thought you were talking about block parity run across columns... -- Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc. work: 781-442-4042 cell: 603.724.2972 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Jorgen Lundman | [EMAIL PROTECTED] Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS system requirements
Just one more things on this: Run with a 64-bit processor. Don't even think of using a 32-bit one - there are known issues with ZFS not quite properly using 32-bit only structures. That is, ZFS is really 64-bit clean, but not 32-bit clean. grin -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS system requirements
On Wed, Sep 17, 2008 at 6:06 AM, Erik Trimble [EMAIL PROTECTED] wrote: Just one more things on this: Run with a 64-bit processor. Don't even think of using a 32-bit one - there are known issues with ZFS not quite properly using 32-bit only structures. That is, ZFS is really 64-bit clean, but not 32-bit clean. Wow ! That's a statement. Can you provide more info on these 32-bit issues ? I am not aware of any. In fact besides being sluggish (presumably due to limited address space) I never noticed any issues with ZFS, which I used on 32-bit machine for 2 years. grin -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Regards, Cyril ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss