[storage-discuss] Reverse sync crash

msl Mon, 18 Feb 2008 13:02:57 -0800

Hello all,
 Today i got some problems with AVS... i was making some tests to see how much 
"updates" were necessary to break the cluster default timeout, and that's what 
happened:
 1) I did a clnode evacuate on node 2 (secondary), and one task of my agent is 
reverse sync the discs to the primary node. After some time (5 minutes, i 
guess), the primary node crashed. The "dsstat" on secondary node showed:
 
name              t  s    pct role   ckps   dkps   tps  svt
dev/rdsk/c2d0s0   S RS   0.00  net      -      0     0    0
dev/rdsk/c2d0s1                bmp      0      0     0    0
dev/rdsk/c3d0s0   S  R   0.00  net      -      0     0    0
dev/rdsk/c3d0s1                bmp      0      0     0    0


 2) After the primary node have started, i did try to import the ZFS 
filesystem, because i have "half" of the mirror OK. And it worked fine, so i 
did a scrub to fix the other half, and exported it again:
 
  pool: MYPOOL
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 68,87% done, 0h0m to go
config:

        NAME        STATE     READ WRITE CKSUM
        B2007       ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2d0s0  ONLINE       0     0   181
            c3d0s0  ONLINE       0     0     0

errors: No known data errors

 Until now everything seems to be fine, because i think the timeout of the sun 
cluster have made the primary node crash. My agent's log (last line) is:

  nsd_prenet_start-b2007-nonshareddevice-rs[29476] 20080218-155524: running: 
/usr/sbin/sndradm -C local -g B2007 -n -w

 So, the agent was waiting the resync task, and it took more than five 
minutes.. (i'm copying to ha-cluster to confirm that).

 But the worst part was that i did try to import the zfs POOL on the secondary 
node, and the secondary server crashed too. :( That filesystem was supposed to 
be 100% OK!)
 As i was able to mount the filesystem on the primary node, i could start the 
"resource group" there, and now i have the services available again.
 I have the core file on the primary node "/core", but "none" on the secondary 
node. The questions:
 1) How can i confirm that the primary node crashed because of the timeout 
limit (the core file can help in that situation)?
 2) How can i know why the secondary node crashed trying to import the zpool?

 Thanks a lot!

 Leal.
 
 
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

[storage-discuss] Reverse sync crash

Reply via email to