[zfs-discuss] Interesting Pool Import Failure

2008-09-16 Thread Solaris
Hello...  Since there has been much discussion about zpool import failures
resulting in loss of an entire pool, I thought I would illustrate a scenario
I just went through to recover a faulted pool that wouldn't import under
Solaris 10 U5.  While this is a simple scenario, and the data was not
terribly important, I think the exercise should at least give some piece of
mind to those who are not always satisfied with the idea that something that
remains a problem in Solaris 10 Ux has been resolved in OpenSolaris.  The
end result is that I'm still running Solaris 10 U5 with a pool full of
uncorrupted data.

I have been evaluating a Sun Fire X4500.  It came with Solaris 10 U5
pre-installed, and went through the config steps leaving most items for
defaults.

I created 1 zpool with 8 raidz vdevs and two spares, each vdev having target
n from each controller with the exception of two vdevs which have the
bootable components and the two spares..  During the evaluation it was
discovered that in order to share iscsi targets with some iscsi intiators I
should switch over to 2008.05.  I thought I would simply just be able to
install 2008.05 over top of Sol10U5 and import / upgrade the pool and move
on.  Indeed I probably can do that, but it didn't quite go as planned...

So under U5, the boot devices (mirrored w/ SVM) were c0t1d0 and c5t0d0, I
exported the zpool and rebooted.  Through the ILOM's Java console I mounted
the 2008.05 iso and booted from it.  When the OS came online, I quickly
started the installer and selected the device labeled c0t1d0 to as the
installation target this was rather stupid on my part  After
rebooting when the install completed I quickly found myself loading Solaris
10 U5 again?  Huh?!?  Unfortunately, the disk naming was not consistent...
2008.05's c0t1d0 was U5's c4t0d0 a member of the first raidz vdev.  Solaris
10 U5 booted up with lots of service faults about not being able to import
the pool.  (See the zpool import output below)  Zpool import showed an
additional pool called rpool (from the 2008.05 install), but I could not act
on that pool as it is a more advance verion of ZFS than U5 understands.

So, after discovering this, I booted back to the 2008.05 iso and destroyed
the rpool on c4t0d0, thinking all might be ok afterwards with just some
resilvering action being required.  Booting back into U5, zpool import still
reports the following...

[EMAIL PROTECTED]:/]# zpool import
  pool: tank
id: 12280374646305972114
 state: FAULTED
action: The pool cannot be imported due to damaged devices or data.
config:

tankUNAVAIL   insufficient replicas
  raidz1UNAVAIL   corrupted data
c0t0d0  ONLINE
c1t0d0  ONLINE
c4t0d0  ONLINE
c6t0d0  ONLINE
c7t0d0  ONLINE
. . . .snip. . . .

Booting again into the iso for 2008.05, zpool import showed the pool in
degraded state.  I was able to replace the failed disk with a spare, and
then revert back to U5 and import the pool successfully, replace the spare
with the original drive, and then reconfigure the spare.  So now I'm back to
running Solaris 10 U5 stable with my 44 disk pool full of iscsi volumes.
Point being that even if you can't run OpenSolaris due to support issues,
you may still be able to use OpenSolaris to help resolve ZFS issues that you
might run into in Solaris 10.

Apologies for not having any command history under 2008.05 to show off.  I
was using the ILOM console to mount the 2008.05 iso and unfortunately
there's no copy/paste between the ILOM console and my workstation.  In all
it was a quite simple fix, just took me a while to wrap my brain around how
to go about it.

-- 

Ignorance: America's most abundant and costly commodity.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Do you grok it?

2008-09-16 Thread Wade . Stuart


[EMAIL PROTECTED] wrote on 09/15/2008 11:32:15 PM:

 Brandon High wrote:
  On Fri, Sep 12, 2008 at 11:49 AM, Dale Ghent [EMAIL PROTECTED]
wrote:
 
  Did I detect a (well-done) metaphor for shared ZFS?
 
 
  Probably not. It looks like a deduplication / MAID solution.
 

 Yeah, I think they blew it on the colors of the gumballs in the machine
:-)
 But other than that, it looks like a very interesting machine.
  -- richard

I am looking forward to clause 3.1 (source code Availability) of CDDL as
from their marketing descriptions they appear to have implemented some form
of in-ZFS hash de-dupe.

-Wade

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting Pool Import Failure

2008-09-16 Thread Miles Nordin
 s == Solaris  [EMAIL PROTECTED] writes:

 s Point being that even if you can't run OpenSolaris due to
 s support issues, you may still be able to use OpenSolaris to
 s help resolve ZFS issues that you might run into in Solaris 10.

glad ZFS is improving, but this sentence is a fantastic bit of
Newthink.


pgpZS9eQQdNQL.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] A few questions

2008-09-16 Thread gm_sjo
2008/9/15 gm_sjo:
 2008/9/15 Ben Rockwood:
 On Thumpers I've created single pools of 44 disks, in 11 disk RAIDZ2's.
 I've come to regret this.  I recommend keeping pools reasonably sized
 and to keep stripes thinner than this.

 Could you clarify why you came to regret it? I was intending to create
 a single pool for 8 1TB disks.

Sorry, just bouncing the back for Ben incase he missed it.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] iscsi target problems on snv_97

2008-09-16 Thread Moore, Joe
I've recently upgraded my x4500 to Nevada build 97, and am having problems with 
the iscsi target.

Background: this box is used to serve NFS underlying a VMware ESX environment 
(zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) 
for a Windows host and to act as zoneroots for Solaris 10 hosts.  For optimal 
random-read performance, I've configured a single zfs pool of mirrored VDEVs of 
all 44 disks (+2 boot disks, +2 spares = 48)

Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool 
would stop occasionally.

Since the upgrade, that hasn't happened, and the NFS clients are quite happy.  
The iSCSI initiators are not.

The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on 
Windows 2003 SP2 x64 Enterprise Edition.  When the system reboots, it is not 
able to connect to its iscsi targets.  No devices are found until I restart the 
iscsitgt process on the x4500, at which point the initiator will reconnect and 
find everything.  I notice that on the x4500, it maintains an active TCP 
connection (according to netstat -an | grep 3260) to the Windows box through 
the reboot and for a long time afterwards.  The initiator starts a second 
connection, but it seems that the target doesn't let go of the old one.  Or 
something.  At this point, every time I reboot the Windows system I have to 
`pkill iscsitgtd`

The Solaris system is running S10 Update 4.  Every once in a while (twice 
today, and not correlated with the pkill's above) the system reports that all 
of the iscsi disks are unavailable.  Nothing I've tried short of a reboot of 
the whole host brings them back.  All of the zones on the system remount their 
zoneroots read-only (and give I/O errors when read or zlogin'd to)

There are a set of TCP connections from the zonehost to the x4500 that remain 
even through disabling the iscsi_initiator service.  There's no process holding 
them as far as pfiles can tell.

Does this sound familiar to anyone?  Any suggestions on what I can do to 
troubleshoot further?  I have a kernel dump from the zonehost and a snoop 
capture of the wire for the Windows host (but it's big).

I'll be opening a bug too.

Thanks,
--Joe
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] A few questions

2008-09-16 Thread Peter Tribble
On Tue, Sep 16, 2008 at 10:03 PM, Ben Rockwood [EMAIL PROTECTED] wrote:
 gm_sjo wrote:
 2008/9/15 gm_sjo:

 2008/9/15 Ben Rockwood:

 On Thumpers I've created single pools of 44 disks, in 11 disk RAIDZ2's.
 I've come to regret this.  I recommend keeping pools reasonably sized
 and to keep stripes thinner than this.

 Could you clarify why you came to regret it? I was intending to create
 a single pool for 8 1TB disks.


 Sorry, just bouncing the back for Ben incase he missed it.


 No, I didn't miss it, just was hoping I could get some benchmarking in
 to justify my points.


 You want to keep stripes wide to reduce wasted disk space but you
 also want to keep them narrow to reduce the elements involved in parity
 calculation.  In light home use I don't see a problem with an 8 disk
 RAIDZ/RAIDZ2.  If your serving in a multi-user environment your primary
 concern is to reduce the movement of the disk heads, and thus narrower
 stripes become adventagious.

I'm not sure that the width of the stripe is directly a problem. But
what is true
is that the random read performance of raidz1/2 is basically that of a single
drive, so having more vdevs is better. Given a fixed number of drives, more
vdevs implies narrower stripes, but that's a side-effect rather than a cause.

For what it's worth, we put all the disks on our thumpers into a single pool -
mostly it's 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the OS and
would happily go much bigger.

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZPOOL Import Problem

2008-09-16 Thread Miles Nordin
 jd == Jim Dunham [EMAIL PROTECTED] writes:

jd If at the time the SNDR replica is deleted the set was
jd actively replicating, along with ZFS actively writing to the
jd ZFS storage pool, I/O consistency will be lost, leaving ZFS
jd storage pool in an indeterministic state on the remote node.

jd To address this issue, prior to deleting the replicas, the
jd replica should be placed into logging mode first.

What if you stop the replication by breaking the network connection
between primary and replica?  consistent or inconsistent?

it sounds fishy, like ``we're always-consistent-on-disk with ZFS, but
please use 'zpool offline' to avoid disastrous pool corruption.''

jd ndr_ii. This is an automatic snapshot taken before
jd resynchronization starts,

yeah that sounds fine, possibly better than DRBD in one way because it
might allow the resync to go faster.  

From the PDF's it sounds like async replication isn't done the same
way as the resync, it's done safely, and that it's even possible for
async replication to accumulate hours of backlog in a ``disk queue''
without losing write ordering so long as you use the ``blocking mode''
variant of async.

ii might also be good for debugging a corrupt ZFS, so you can tinker
with it but still roll back to the original corrupt copy.  I'll read
about it---I'm guessing I will need to prepare ahead of time if I want
ii available in the toolbox after a disaster.

jd AVS has the concept of I/O consistency groups, where all disks
jd of a multi-volume filesystem (ZFS, QFS) or database (Oracle,
jd Sybase) are kept write-order consistent when using either sync
jd or async replication.

Awesome, so long as people know to use it.  so I guess that's the
answer for the OP: use consistency groups!

The one thing I worry about is, before, AVS was used between RAID and
filesystem, which is impossible now because that inter-layer area n
olonger exists.  If you put the individual device members of a
redundant zpool vdev into an AVS consistency group, what will AVS do
when one of the devices fails?

Does it continue replicating the working devices and ignore the failed
one?  This would sacrifice redundancy at the DR site.  UFS-AVS-RAID
would not do that in the same situation.

Or hide the failed device from ZFS and slow things down by sending all
read/writes of the failed device to the remote mirror?  This would
slwo down the primary site.  UFS-AVS-RAID would not do that in the
same situation.

The latter ZFS-AVS behavior might be rescueable, if ZFS had the
statistical read-preference feature.  but writes would still be
massively slowed with this scenario, while in UFS-AVS-RAID they would
not be.  To get back the level of control one used to have for writes,
you'd need a different zpool-level way to achieve the intent of the
AVS sync/async option.  Maybe just a slog which is not AVS-replicated
would be enough, modulo other ZFS fixes for hiding slow devices.


pgpzm3T09CxRc.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] iscsi target problems on snv_97

2008-09-16 Thread tim szeto
Moore, Joe wrote:
 I've recently upgraded my x4500 to Nevada build 97, and am having problems 
 with the iscsi target.

 Background: this box is used to serve NFS underlying a VMware ESX environment 
 (zfs filesystem-type datasets) and presents iSCSI targets (zfs zvol datasets) 
 for a Windows host and to act as zoneroots for Solaris 10 hosts.  For optimal 
 random-read performance, I've configured a single zfs pool of mirrored VDEVs 
 of all 44 disks (+2 boot disks, +2 spares = 48)

 Before the upgrade, the box was flaky under load: all I/Os to the ZFS pool 
 would stop occasionally.

 Since the upgrade, that hasn't happened, and the NFS clients are quite happy. 
  The iSCSI initiators are not.

 The windows initiator is running the Microsoft iSCSI initiator v2.0.6 on 
 Windows 2003 SP2 x64 Enterprise Edition.  When the system reboots, it is not 
 able to connect to its iscsi targets.  No devices are found until I restart 
 the iscsitgt process on the x4500, at which point the initiator will 
 reconnect and find everything.  I notice that on the x4500, it maintains an 
 active TCP connection (according to netstat -an | grep 3260) to the Windows 
 box through the reboot and for a long time afterwards.  The initiator starts 
 a second connection, but it seems that the target doesn't let go of the old 
 one.  Or something.  At this point, every time I reboot the Windows system I 
 have to `pkill iscsitgtd`
   
 The Solaris system is running S10 Update 4.  Every once in a while (twice 
 today, and not correlated with the pkill's above) the system reports that all 
 of the iscsi disks are unavailable.  Nothing I've tried short of a reboot of 
 the whole host brings them back.  All of the zones on the system remount 
 their zoneroots read-only (and give I/O errors when read or zlogin'd to)

 There are a set of TCP connections from the zonehost to the x4500 that remain 
 even through disabling the iscsi_initiator service.  There's no process 
 holding them as far as pfiles can tell.

 Does this sound familiar to anyone?  Any suggestions on what I can do to 
 troubleshoot further?  I have a kernel dump from the zonehost and a snoop 
 capture of the wire for the Windows host (but it's big).
   
I believe the problem you're seeing might be related to deadlock 
condition (CR 6745310), if you run pstack on the
iscsi target  daemon you might find a bunch of zombie threads.  The fix 
is putback to snv-99, give snv-99 a try.

-Tim

 I'll be opening a bug too.

 Thanks,
 --Joe
 ___
 storage-discuss mailing list
 [EMAIL PROTECTED]
 http://mail.opensolaris.org/mailman/listinfo/storage-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] A few questions

2008-09-16 Thread mike
On Tue, Sep 16, 2008 at 2:28 PM, Peter Tribble [EMAIL PROTECTED] wrote:

 For what it's worth, we put all the disks on our thumpers into a single pool -
 mostly it's 5x 8+1 raidz1 vdevs with a hot spare and 2 drives for the OS and
 would happily go much bigger.

so you have 9 drive raidz1 (8 disks usable) +  hot spare, or
8 drive raidz1 (7 disks usable) +  hot spare?

It sounds like people -can- build larger pools but due to their
storage needs (performance, availability, etc) choose NOT to. For home
usage with maybe 4 clients maximum and can deal with downtime when
swapping out a drive, I think I can live with decent performance
(not insane) and try to maximize my space (without making ZFS's
redundancy features useless.)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] x4500 vs AVS ?

2008-09-16 Thread Jorgen Lundman

Sorry, I popped up to Hokkdaido for a holiday. I want to thank you all 
for the replies.

I mentioned AVS as I thought it to do be the only product close to 
enabling us to do a (makeshift) fail-over setup.

We have 5-6 ZFS filesystem, and 5-6 zvol with UFS (for quotas). To do 
zfs send snapshots every minute might perhaps be possible (just not 
very attractive), but if the script dies at any time, you need to resend 
the full volumes, this currently takes 5 days. (Even using nc).

Since we are forced by vendor to run Sol10, it sounds like AVS is not an 
option for us.

If we were interested in finding a method to replicate data to a 2nd 
x4500, what other options are there for us? We do not need instant 
updates, just someplace to fail-over to when the x4500 panics, or a HDD 
dies. (Which equals panic) It currently takes 2 hours to fsck the UFS 
volumes after a panic (and yes, they are logging; it is actually just 
the one UFS volume that always needs fsck).

Vendor has mentioned VeritasVolumReplicator but I was under the 
impression that Veritas is a whole different set to zfs/zpool.

Lund




Jim Dunham wrote:
 On Sep 11, 2008, at 5:16 PM, A Darren Dunham wrote:
 On Thu, Sep 11, 2008 at 04:28:03PM -0400, Jim Dunham wrote:
 On Sep 11, 2008, at 11:19 AM, A Darren Dunham wrote:

 On Thu, Sep 11, 2008 at 10:33:00AM -0400, Jim Dunham wrote:
 The issue with any form of RAID 1, is that the instant a disk  
 fails
 out of the RAID set, with the next write I/O to the remaining  
 members
 of the RAID set, the failed disk (and its replica) are instantly  
 out
 of sync.
 Does raidz fall into that category?
 Yes. The key reason is that as soon as ZFS (or other mirroring  
 software)
 detects a disk failure in a RAID 1 set, it will stop writing to the
 failed disk, which also means it will also stop writing to the  
 replica of
 the failed disk. From the point of view of the remote node, the  
 replica
 of the failed disk is no longer being updated.

 Now if replication was stopped, or the primary node powered off or
 panicked, during the import of the ZFS storage pool on the secondary
 node, the replica of the failed disk must not be part of the ZFS  
 storage
 pool as its data is stale. This happens automatically, since the ZFS
 metadata on the remaining disks have already given up on this  
 member of
 the RAID set.
 Then I misunderstood what you were talking about.  Why the restriction
 on RAID 1 for your statement?
 
 No restriction. I meant to say, RAID 1 or greater.
 
 Even for a mirror, the data is stale and
 it's removed from the active set.  I thought you were talking about
 block parity run across columns...

 -- 
 Darren
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 Jim Dunham
 Engineering Manager
 Storage Platform Software Group
 Sun Microsystems, Inc.
 work: 781-442-4042
 cell: 603.724.2972
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

-- 
Jorgen Lundman   | [EMAIL PROTECTED]
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS system requirements

2008-09-16 Thread Erik Trimble
Just one more things on this:

Run with a 64-bit processor. Don't even think of using a 32-bit one -
there are known issues with ZFS not quite properly using 32-bit only
structures.  That is, ZFS is really 64-bit clean, but not 32-bit clean.

grin


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS system requirements

2008-09-16 Thread Cyril Plisko
On Wed, Sep 17, 2008 at 6:06 AM, Erik Trimble [EMAIL PROTECTED] wrote:
 Just one more things on this:

 Run with a 64-bit processor. Don't even think of using a 32-bit one -
 there are known issues with ZFS not quite properly using 32-bit only
 structures.  That is, ZFS is really 64-bit clean, but not 32-bit clean.


Wow ! That's a statement. Can you provide more info on these 32-bit issues ?
I am not aware of any. In fact besides being sluggish (presumably due
to limited address space) I never noticed any issues with ZFS, which I
used on 32-bit machine for 2 years.

 grin


 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Regards,
 Cyril
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss