Re: [storage-discuss] AVS - SNDR: Recovery bitmaps not allocated

Jim Dunham Thu, 23 Apr 2009 06:49:18 -0700

Roman,

A weird issue:
1. avs works for connections on a local switch via local freebsdrouter connected to the switch
 host1 -> switch -> router freebsd -> switch -> host2
2. When trying to emulate replication using far distance remoteconnection with the freebsd router on the remote side then AVS failswith the error:
[b]sndradm: warning: SNDR: Recovery bitmaps not allocated
[/b]

First of all, what version of AVS / OpenSolaris are you running? Thereason I ask, is that this error message being returned from"sndradm", was a problem partially resolved for AVS on Solaris 10, orAVS bundled with OpenSolaris.

The specific issue at hand, is that during the first stages of an"sndradm -u ...", update command, the SNDR secondary node is asked tosend its entire bitmap to the SNDR primary node. The operation is donevia a Solaris RPC call, an operation which has an associated timeoutvalue. If the amount of time it takes to send this data over thenetwork from the secondary node to primary node, exceeds the RPCtimeout value, the operation fails with a "Recovery bitmaps notallocated".

The (partial) bug fix was to not send this bitmap data as a single RPCcall, but to send pieces of the bitmap in multiple RPC calls. Althoughsending the bitmap in pieces reduces the likelihood of an RPC timeout,it does not prevent all timeouts, especially if network bandwidth islow, or link latency is high.

Since sending the bitmap is a function of its size, link bandwidth andlink latency, your attempt to emulate replication using a far distanceremote connection, indicates to me that the RPC transfers are stilltiming out.

The default value for the RPC timeout is 16 seconds (0x10 hex). Tochange this, adjust the value of rdc_rpc_tmout in the file /etc/system.


        set rdc:rdc_rpc_tmout=0x20

You can make sure the change has been made by running:

        echo "::rdc" | mdb -k | grep RPC

Be forewarned, there is an SNDR runtime consequence to increasing thisvalue, being that for other network or system issues causing the SNDRsecondary node to not respond in a timely manner, it will take SNDRlonger to detect these types of failures with a larger timeout value.

Full replication nevertheless works in this case, so there areabsolutely no problems with network I assume.

Full replication, verses update replication, does not require theexchange of bitmaps secondary to primary. Full replication makes thedetermination that all primary volumes blocks differ from thesecondary volume blocks, and sets the bitmap volume to all 1's. Updatereplication, OR's together the changed bitmaps from both the SNDRprimary and secondary nodes, which results in replicating only thoseblocks that have changed.

I tried to trace start of sndradm with truss, because there is noobvious reason for me why it fails. Network ok, name resolution isok, rpc responds.

More then 60% of SNDR is kernel mode code, so the follow user modetrace of sndradm will not show where the real error lies. Also theabove error should not happen as a result of "sndradm -nE", butinstead "sndradm -nu". For the update (-u) option, I would expect anerror back from an ioctl(,0x0504400) call.


- Jim

What I can see is only the following difference in traces for thesndradm -nE command when replication locally and remotely:
getpid()                                = 3245 [3244]          | getpid()       
                            = 3315 [3314]
fcntl(5, F_SETLKW, 0x08046608) = 0 | fcntl(5,F_SETLKW, 0x08046608) (sleeping...)lseek(5, 0, SEEK_SET) = 0 | Stoppedby signal #24, SIGTSTP, in fcntl()read(5, " I G A M", 4) = 4 |Received signal #25, SIGCONT, in fcntl() [default]
lseek(5, 0, SEEK_SET)                   = 0             |       siginfo: 
SIGCONT pid=2426 uid=0
read(5, " I G A M\f\0\0\0 f #EF I".., 148) = 148 |[b] fcntl(5,F_SETLKW, 0x08046608) [/b] (sleeping...)
read(5, " C : s c m . t h r e a d".., 16384)    = 16384  <
read(5, "   1 2 8   6 4   -     -".., 36)       = 36   <
lseek(5, 2097116, SEEK_CUR)                = 2113684     <
read(5, "   1 2 8   6 4   -     -".., 36)       = 36   <
lseek(5, 2097116, SEEK_CUR)                = 4210836     <
read(5, "01\0\0\01D\0\0\001\0\0\0".., 524288)   = 524288 <


fcntl(5, F_SETLKW, 0x08046608)  fails? Or is this something else?

Thanks,
Roman
--
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss


_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] AVS - SNDR: Recovery bitmaps not allocated

Reply via email to