Leal,

{ Question inline }


> Hello all,
> Today i got some problems with AVS... i was making some tests to see  
> how much "updates" were necessary to break the cluster default  
> timeout, and that's what happened:
> 1) I did a clnode evacuate on node 2 (secondary), and one task of my  
> agent is reverse sync the discs to the primary node. After some time  
> (5 minutes, i guess), the primary node crashed. The "dsstat" on  
> secondary node showed:
>
> name              t  s    pct role   ckps   dkps   tps  svt
> dev/rdsk/c2d0s0   S RS   0.00  net      -      0     0    0
> dev/rdsk/c2d0s1                bmp      0      0     0    0
> dev/rdsk/c3d0s0   S  R   0.00  net      -      0     0    0
> dev/rdsk/c3d0s1                bmp      0      0     0    0
>
> 2) After the primary node have started, i did try to import the ZFS  
> filesystem, because i have "half" of the mirror OK. And it worked  
> fine, so i did a scrub to fix the other half, and exported it again:
>
>  pool: MYPOOL
> state: ONLINE
> status: One or more devices has experienced an unrecoverable error.   
> An
>        attempt was made to correct the error.  Applications are  
> unaffected.
> action: Determine if the device needs to be replaced, and clear the  
> errors
>        using 'zpool clear' or replace the device with 'zpool replace'.
>   see: http://www.sun.com/msg/ZFS-8000-9P
> scrub: scrub in progress, 68,87% done, 0h0m to go
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        B2007       ONLINE       0     0     0
>          mirror    ONLINE       0     0     0
>            c2d0s0  ONLINE       0     0   181
>            c3d0s0  ONLINE       0     0     0
>
> errors: No known data errors
>
> Until now everything seems to be fine, because i think the timeout  
> of the sun cluster have made the primary node crash. My agent's log  
> (last line) is:
>
>  nsd_prenet_start-b2007-nonshareddevice-rs[29476] 20080218-155524:  
> running: /usr/sbin/sndradm -C local -g B2007 -n -w

What is the purpose and context of the line above, since an the  
sndrboot -s (suspend), later followed by sndrboot -r (resume), works  
independently of active replication, or logging mode.

- Jim

>
>
> So, the agent was waiting the resync task, and it took more than  
> five minutes.. (i'm copying to ha-cluster to confirm that).
>
> But the worst part was that i did try to import the zfs POOL on the  
> secondary node, and the secondary server crashed too. :( That  
> filesystem was supposed to be 100% OK!)
> As i was able to mount the filesystem on the primary node, i could  
> start the "resource group" there, and now i have the services  
> available again.
> I have the core file on the primary node "/core", but "none" on the  
> secondary node. The questions:
> 1) How can i confirm that the primary node crashed because of the  
> timeout limit (the core file can help in that situation)?
> 2) How can i know why the secondary node crashed trying to import  
> the zpool?
>
> Thanks a lot!
>
> Leal.
>
>
> This message posted from opensolaris.org
> _______________________________________________
> storage-discuss mailing list
> [email protected]
> http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.

_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to