Roman,
> A weird issue:
> 1. avs works for connections on a local switch via local freebsd
> router connected to the switch
>  host1 -> switch -> router freebsd -> switch -> host2
> 2.  When trying to emulate replication using far distance remote
> connection with the freebsd router on the remote side then AVS fails
> with the error:
> [b]sndradm: warning: SNDR: Recovery bitmaps not allocated
> [/b]

First of all, what version of AVS / OpenSolaris are you running? The
reason I ask, is that this error message being returned from
"sndradm", was a problem partially resolved for AVS on Solaris 10, or
AVS bundled with OpenSolaris.

# sndradm -v
Remote Mirror version 11.11
# uname -a
SunOS tor.flt 5.11 snv_101b i86pc i386 i86pc Solaris
The specific issue at hand, is that during the first stages of an
"sndradm -u ...", update command, the SNDR secondary node is asked to
send its entire bitmap to the SNDR primary node. The operation is done
via a Solaris RPC call, an operation which has an associated timeout
value. If the amount of time it takes to send this data over the
network from the secondary node to primary node, exceeds the RPC
timeout value, the operation fails with a "Recovery bitmaps not
allocated".

It's strange that sndr sends entire bitmap - what if one is for a big replicated volumes like for 1.36Gb? It's more that 100000 blocks for async replication, .
There will be constant timeouts on average 100M link in this case.

SNDR does not replicate the bitmap volume, just the bitmap itself. There is one bit per 32KB of primary volume size, with 8 bits per byte, and 512 bytes per block. The answer for 1.36GB is just 11.04 blocks, or 5.5KB.
The (partial) bug fix was to not send this bitmap data as a single RPC call, but to send pieces of the bitmap in multiple RPC calls. Although sending the bitmap in pieces reduces the likelihood of an RPC timeout,
it does not prevent all timeouts, especially if network bandwidth is
low, or link latency is high.

Since sending the bitmap is a function of its size, link bandwidth and link latency, your attempt to emulate replication using a far distance
remote connection, indicates to me that the RPC transfers are still
timing out.

I just wonder why sndr doesn't have any information about such timeouts in logs?

I am not sure why the timeout is not logged, but it likely has to do with the fact that this operation, the OR'ing of bitmaps, is an initialization function, not a data path function within SNDR.

Is this designed for special purpose?

If you mean SNDR, no. SNDR is storage, filesystem and database agnostic.
The default value for the RPC timeout is 16 seconds (0x10 hex). To
change this, adjust the value of rdc_rpc_tmout in the file /etc/ system.

        set rdc:rdc_rpc_tmout=0x20

You can make sure the change has been made by running:

        echo "::rdc" | mdb -k | grep RPC

Be forewarned, there is an SNDR runtime consequence to increasing this
value, being that for other network or system issues causing the SNDR
secondary node to not respond in a timely manner, it will take SNDR
longer to detect these types of failures with a larger timeout value.

I set it up to 50 and replication started to work. I'm testing now.

Good. I kind of figure that this was the problem. What are is your SNDR primary volume size?

> Full replication nevertheless works in this case, so there are
> absolutely no problems with network I assume.

Full replication, verses update replication, does not require the
exchange of bitmaps secondary to primary. Full replication makes the
determination that all primary volumes blocks differ from the
secondary volume blocks, and sets the bitmap volume to all 1's. Update
replication, OR's together the changed bitmaps from both the SNDR
primary and secondary nodes, which results in replicating only those
blocks that have changed.

Ok, it's clear. Full replication just rewrites volume.
But where requested bitmap stored - only in memory?

Sorry, a little slip up on terminology.

On an SNDR full copy operation, it sets the in memory bitmap to all 1's.
On an SNDR update copy operation, it OR's together the in memory primary bitmap, with the recently obtained secondary bitmap.


Another question: if link was down for a long time and blocks have changed couple of time over the downtime - how sndr replicated sequenced writes?

It doesn't.

SNDR has three modes of operation:
        logging mode
        (re)synchronization mode
        replicating mode

In logging mode and replicating mode, SNDR keeps the both SNDR primary and secondary volumes in write-consistent (sequenced) order.

During (re)synchronization mode, SNDR updates the secondary volume in a block (actually bitmap) order. There is only a single bit used to track differences between the primary and secondary volumes, and one bit is equal to 32KB.

If the replication environment, volume content and application availability need be concerned that an SNDR replica is not write-order consistent during (re)synchronization mode, SNDR supports an option call ndr_ii, what takes an automatic compact dependent snapshot of the write-order consistent SNDR secondary volume, and if in the unlikely case that (re)synchronization mode fails, and the SNDR primary volume is lost, the snapshot volume can be restore on the SNDR secondary.

Does it store and if yes where all writing sequence even if rewrites happened to the same block

> I tried to trace start of sndradm with truss, because there is no
> obvious reason for me why it fails. Network ok, name resolution is
> ok, rpc responds.

More then 60% of SNDR is kernel mode code, so the follow user mode
trace of sndradm will not show where the real error lies. Also the
above error should not happen as a result of "sndradm -nE", but
instead "sndradm -nu". For the update (-u) option, I would expect an
error back from an ioctl(,0x0504400) call.


Well, tracing was just a desperate attempt to find any information why replication stales :) But it worked out to a simple tune - thanks for the explanation, it was very detailed and useful.

You're welcome.

- Jim



Thanks Jim!


_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to