Roman,

A weird issue:
1. avs works for connections on a local switch via local freebsd router connected to the switch
 host1 -> switch -> router freebsd -> switch -> host2
2. When trying to emulate replication using far distance remote connection with the freebsd router on the remote side then AVS fails with the error:
[b]sndradm: warning: SNDR: Recovery bitmaps not allocated
[/b]

First of all, what version of AVS / OpenSolaris are you running? The reason I ask, is that this error message being returned from "sndradm", was a problem partially resolved for AVS on Solaris 10, or AVS bundled with OpenSolaris.

The specific issue at hand, is that during the first stages of an "sndradm -u ...", update command, the SNDR secondary node is asked to send its entire bitmap to the SNDR primary node. The operation is done via a Solaris RPC call, an operation which has an associated timeout value. If the amount of time it takes to send this data over the network from the secondary node to primary node, exceeds the RPC timeout value, the operation fails with a "Recovery bitmaps not allocated".

The (partial) bug fix was to not send this bitmap data as a single RPC call, but to send pieces of the bitmap in multiple RPC calls. Although sending the bitmap in pieces reduces the likelihood of an RPC timeout, it does not prevent all timeouts, especially if network bandwidth is low, or link latency is high.

Since sending the bitmap is a function of its size, link bandwidth and link latency, your attempt to emulate replication using a far distance remote connection, indicates to me that the RPC transfers are still timing out.

The default value for the RPC timeout is 16 seconds (0x10 hex). To change this, adjust the value of rdc_rpc_tmout in the file /etc/system.

        set rdc:rdc_rpc_tmout=0x20

You can make sure the change has been made by running:

        echo "::rdc" | mdb -k | grep RPC

Be forewarned, there is an SNDR runtime consequence to increasing this value, being that for other network or system issues causing the SNDR secondary node to not respond in a timely manner, it will take SNDR longer to detect these types of failures with a larger timeout value.

Full replication nevertheless works in this case, so there are absolutely no problems with network I assume.

Full replication, verses update replication, does not require the exchange of bitmaps secondary to primary. Full replication makes the determination that all primary volumes blocks differ from the secondary volume blocks, and sets the bitmap volume to all 1's. Update replication, OR's together the changed bitmaps from both the SNDR primary and secondary nodes, which results in replicating only those blocks that have changed.

I tried to trace start of sndradm with truss, because there is no obvious reason for me why it fails. Network ok, name resolution is ok, rpc responds.

More then 60% of SNDR is kernel mode code, so the follow user mode trace of sndradm will not show where the real error lies. Also the above error should not happen as a result of "sndradm -nE", but instead "sndradm -nu". For the update (-u) option, I would expect an error back from an ioctl(,0x0504400) call.

- Jim

What I can see is only the following difference in traces for the sndradm -nE command when replication locally and remotely:

getpid()                                = 3245 [3244]          | getpid()       
                            = 3315 [3314]
fcntl(5, F_SETLKW, 0x08046608) = 0 | fcntl(5, F_SETLKW, 0x08046608) (sleeping...) lseek(5, 0, SEEK_SET) = 0 | Stopped by signal #24, SIGTSTP, in fcntl() read(5, " I G A M", 4) = 4 | Received signal #25, SIGCONT, in fcntl() [default]
lseek(5, 0, SEEK_SET)                   = 0             |       siginfo: 
SIGCONT pid=2426 uid=0
read(5, " I G A M\f\0\0\0 f #EF I".., 148) = 148 |[b] fcntl(5, F_SETLKW, 0x08046608) [/b] (sleeping...)
read(5, " C : s c m . t h r e a d".., 16384)    = 16384  <
read(5, "   1 2 8   6 4   -     -".., 36)       = 36   <
lseek(5, 2097116, SEEK_CUR)                = 2113684     <
read(5, "   1 2 8   6 4   -     -".., 36)       = 36   <
lseek(5, 2097116, SEEK_CUR)                = 4210836     <
read(5, "01\0\0\01D\0\0\001\0\0\0".., 524288)   = 524288 <


fcntl(5, F_SETLKW, 0x08046608)  fails? Or is this something else?

Thanks,
Roman
--
This message posted from opensolaris.org
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to