Hello all,
Today i got some problems with AVS... i was making some tests to see how much
"updates" were necessary to break the cluster default timeout, and that's what
happened:
1) I did a clnode evacuate on node 2 (secondary), and one task of my agent is
reverse sync the discs to the primary node. After some time (5 minutes, i
guess), the primary node crashed. The "dsstat" on secondary node showed:
name t s pct role ckps dkps tps svt
dev/rdsk/c2d0s0 S RS 0.00 net - 0 0 0
dev/rdsk/c2d0s1 bmp 0 0 0 0
dev/rdsk/c3d0s0 S R 0.00 net - 0 0 0
dev/rdsk/c3d0s1 bmp 0 0 0 0
2) After the primary node have started, i did try to import the ZFS
filesystem, because i have "half" of the mirror OK. And it worked fine, so i
did a scrub to fix the other half, and exported it again:
pool: MYPOOL
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 68,87% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
B2007 ONLINE 0 0 0
mirror ONLINE 0 0 0
c2d0s0 ONLINE 0 0 181
c3d0s0 ONLINE 0 0 0
errors: No known data errors
Until now everything seems to be fine, because i think the timeout of the sun
cluster have made the primary node crash. My agent's log (last line) is:
nsd_prenet_start-b2007-nonshareddevice-rs[29476] 20080218-155524: running:
/usr/sbin/sndradm -C local -g B2007 -n -w
So, the agent was waiting the resync task, and it took more than five
minutes.. (i'm copying to ha-cluster to confirm that).
But the worst part was that i did try to import the zfs POOL on the secondary
node, and the secondary server crashed too. :( That filesystem was supposed to
be 100% OK!)
As i was able to mount the filesystem on the primary node, i could start the
"resource group" there, and now i have the services available again.
I have the core file on the primary node "/core", but "none" on the secondary
node. The questions:
1) How can i confirm that the primary node crashed because of the timeout
limit (the core file can help in that situation)?
2) How can i know why the secondary node crashed trying to import the zpool?
Thanks a lot!
Leal.
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss