Leal,
{ Question inline }
> Hello all,
> Today i got some problems with AVS... i was making some tests to see
> how much "updates" were necessary to break the cluster default
> timeout, and that's what happened:
> 1) I did a clnode evacuate on node 2 (secondary), and one task of my
> agent is reverse sync the discs to the primary node. After some time
> (5 minutes, i guess), the primary node crashed. The "dsstat" on
> secondary node showed:
>
> name t s pct role ckps dkps tps svt
> dev/rdsk/c2d0s0 S RS 0.00 net - 0 0 0
> dev/rdsk/c2d0s1 bmp 0 0 0 0
> dev/rdsk/c3d0s0 S R 0.00 net - 0 0 0
> dev/rdsk/c3d0s1 bmp 0 0 0 0
>
> 2) After the primary node have started, i did try to import the ZFS
> filesystem, because i have "half" of the mirror OK. And it worked
> fine, so i did a scrub to fix the other half, and exported it again:
>
> pool: MYPOOL
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error.
> An
> attempt was made to correct the error. Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the
> errors
> using 'zpool clear' or replace the device with 'zpool replace'.
> see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: scrub in progress, 68,87% done, 0h0m to go
> config:
>
> NAME STATE READ WRITE CKSUM
> B2007 ONLINE 0 0 0
> mirror ONLINE 0 0 0
> c2d0s0 ONLINE 0 0 181
> c3d0s0 ONLINE 0 0 0
>
> errors: No known data errors
>
> Until now everything seems to be fine, because i think the timeout
> of the sun cluster have made the primary node crash. My agent's log
> (last line) is:
>
> nsd_prenet_start-b2007-nonshareddevice-rs[29476] 20080218-155524:
> running: /usr/sbin/sndradm -C local -g B2007 -n -w
What is the purpose and context of the line above, since an the
sndrboot -s (suspend), later followed by sndrboot -r (resume), works
independently of active replication, or logging mode.
- Jim
>
>
> So, the agent was waiting the resync task, and it took more than
> five minutes.. (i'm copying to ha-cluster to confirm that).
>
> But the worst part was that i did try to import the zfs POOL on the
> secondary node, and the secondary server crashed too. :( That
> filesystem was supposed to be 100% OK!)
> As i was able to mount the filesystem on the primary node, i could
> start the "resource group" there, and now i have the services
> available again.
> I have the core file on the primary node "/core", but "none" on the
> secondary node. The questions:
> 1) How can i confirm that the primary node crashed because of the
> timeout limit (the core file can help in that situation)?
> 2) How can i know why the secondary node crashed trying to import
> the zpool?
>
> Thanks a lot!
>
> Leal.
>
>
> This message posted from opensolaris.org
> _______________________________________________
> storage-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/storage-discuss
Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss