Hi Ethan,

Great job putting this pool back together...

I would agree with the disk-by-disk replacement by using the zpool
replace command. You can read about this command here:

http://docs.sun.com/app/docs/doc/817-2271/gazgd?a=view

Having a recent full backup of your data before making any more changes
is always recommended.

You might be able to figure out if c9t1d0p0 is a failing disk by
checking the fmdump -eV output but with the changing devices, it might
be difficult to isolate with gobs of output.

Also, if you are using whole disks, then use the c9t*d* designations.
The s* designations are unnecessary for whole disks and building pools
with p* devices isn't recommended.

Thanks,

Cindy


On 02/18/10 10:42, Ethan wrote:
On Thu, Feb 18, 2010 at 04:14, Daniel Carosone <d...@geek.com.au <mailto:d...@geek.com.au>> wrote:

    On Wed, Feb 17, 2010 at 11:37:54PM -0500, Ethan wrote:
     > > It seems to me that you could also use the approach of 'zpool
    replace' for
     > That is true. It seems like it then have to rebuild from parity
    for every
     > drive, though, which I think would take rather a long while,
    wouldn't it?

    No longer than copying - plus, it will only resilver active data, so
    unless the pool is close to full it could save some time.  Certainly
    it will save some hassle and risk of error, plugging and swapping drives
    between machines more times.  As a further benefit, all this work will
    count towards a qualification cycle for the current hardware setup.

    I would recommend using replace, one drive at a time. Since you still
    have the original drives to fall back on, you can do this now (before
    making more changes to the pool with new data) without being overly
    worried about a second failure killing your raidz1 pool.  Normally,
    when doing replacements like this on a singly-redundant pool, it's a
    good idea to run a scrub after each replace, making sure everything
    you just wrote is valid before relying on it to resilver the next
    disk.

    If you're keen on copying, I'd suggest doing over the network; that
    way your write target is a system that knows the target partitioning
    and there's no (mis)caclulation of offsets.

    --
    Dan.



These are good points - it sounds like replacing one at a time is the way to go. Thanks for pointing out these benefits. Although I do notice that right now, it imports just fine using the p0 devices using just `zpool import q`, no longer having to use import -d with the directory of symlinks to p0 devices. I guess this has to do with having repaired the labels and such? Or whatever it's repaired having successfully imported and scrubbed.
After the scrub finished, this is the state of my pool:


# zpool status
  pool: q
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed after 7h18m with 0 errors on Thu Feb 18 06:25:44 2010
config:

        NAME                                  STATE     READ WRITE CKSUM
        q                                     DEGRADED     0     0     0
          raidz1                              DEGRADED     0     0     0
            /export/home/ethan/qdsk/c9t4d0p0  ONLINE       0     0     0
            /export/home/ethan/qdsk/c9t5d0p0  ONLINE       0     0     0
            /export/home/ethan/qdsk/c9t2d0p0  ONLINE       0     0     0
/export/home/ethan/qdsk/c9t1d0p0 DEGRADED 4 0 60 too many errors
            /export/home/ethan/qdsk/c9t0d0p0  ONLINE       0     0     0

errors: No known data errors


I have no idea what happened to the one disk, but "No known data errors" is what makes me happy. I'm not sure if I should be concerned about the physical disk itself, or just assume that some data got screwed up with all this mess. I guess maybe I'll see how the disk behaves during the replace operations (restoring to it and then restoring from it four times seems like a pretty good test of it), and if it continues to error, replace the physical drive and if necessary restore from the original truecrypt volumes.

So, current plan:
- export the pool.
- format c9t1d0 to have one slice being the entire disk.
- import. should be degraded, missing c9t1d0p0.
- replace missing c9t1d0p0 with c9t1d0 (should this be c9t1d0s0? my understanding is that zfs will treat the two about the same, since it adds the partition table to raw devices if that's what it's given and ends up using s0 anyway)
- wait for resilver.
- repeat with the other four disks.

Sound good?

-Ethan


------------------------------------------------------------------------

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to