Thanks for your help!
Actually I have some more questions. I need to make a decision on
replication mode for our storages: zfs send-receive, avs or even
microsoft internal tool on the iscsi volumes with independent zfs
snashots on both side.
Initially avs seemed to me a good options, but I can't make it working
on 100Mb link with 8x1.36Tb volumes
Roman,
> A weird issue:
> 1. avs works for connections on a local switch via local freebsd
> router connected to the switch
> host1 -> switch -> router freebsd -> switch -> host2
> 2. When trying to emulate replication using far distance remote
> connection with the freebsd router on the remote side then AVS fails
> with the error:
> [b]sndradm: warning: SNDR: Recovery bitmaps not allocated
> [/b]
First of all, what version of AVS / OpenSolaris are you running? The
reason I ask, is that this error message being returned from
"sndradm", was a problem partially resolved for AVS on Solaris 10, or
AVS bundled with OpenSolaris.
# sndradm -v
Remote Mirror version 11.11
# uname -a
SunOS tor.flt 5.11 snv_101b i86pc i386 i86pc Solaris
The specific issue at hand, is that during the first stages of an
"sndradm -u ...", update command, the SNDR secondary node is asked to
send its entire bitmap to the SNDR primary node. The operation is done
via a Solaris RPC call, an operation which has an associated timeout
value. If the amount of time it takes to send this data over the
network from the secondary node to primary node, exceeds the RPC
timeout value, the operation fails with a "Recovery bitmaps not
allocated".
It's strange that sndr sends entire bitmap - what if one is for a big
replicated volumes like for 1.36Gb? It's more that 100000 blocks for
async replication, .
There will be constant timeouts on average 100M link in this case.
SNDR does not replicate the bitmap volume, just the bitmap itself.
There is one bit per 32KB of primary volume size, with 8 bits per
byte, and 512 bytes per block. The answer for 1.36GB is just 11.04
blocks, or 5.5KB.
But the dsbitmap shows 100441 blocks for async replication, I'm missing
something?
Required bitmap volume size:
Sync replication: 11161 blocks
Async replication with memory queue: 11161 blocks
Async replication with disk queue: 100441 blocks
Async replication with disk queue and 32 bit refcount: 368281 blocks
Good. I kind of figure that this was the problem. What are is your
SNDR primary volume size?
After initial sync started to work (although it's a very slow process
and it takes 10-15 mins to complete) I have the following situation:
1. Storage (8x1.36Tb in one raidz2 pool):
[email protected]# sndradm -i
tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2
/dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 mtl2.flt2
/dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 mtl2.flt2
/dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t3d0s0 /dev/md/rdsk/bmp3 mtl2.flt2
/dev/rdsk/c3t3d0s0 /dev/md/rdsk/bmp3 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 mtl2.flt2
/dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 mtl2.flt2
/dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 mtl2.flt2
/dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2
/dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2
/dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool
2. Bitmaps are on the mirrored metadevice, they are bigger than you
mentioned but this is what dsbitmap shows for volumes:
bmp0: Soft Partition
Device: d100
State: Okay
Size: 100441 blocks (49 MB)
Extent Start Block Block count
0 34 100441
3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2
4. Latency:
[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms
I'm emulating on Freebsd the actual delay and the speed for the real
circuit which is 100Mb and 16ms.
5. The queue during writes with the speed 40Mbite/s on the main host:
[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
The problems:
1.
In replication mode data transmission on freebsd is only 2.5 Mbite/s for
rcp traffic which is quite lower the numbers netio shows:
----------------- Real link, 100Mb; 16 delay -------------
TCP connection established.
Packet size 1k bytes: 3239 KByte/s Tx, 5885 KByte/s Rx.
2.
When initial synchronization (sndradm -nu) happens it's traffic is
almost zero.
But, if it's gig connection on the switch then the syncro is pretty
fast, maybe dozed of seconds instead of minutes.
3.
But the real problem is that iscsi initiator writings stale because of
sndr replication. The windows initiator just hungs up and can be only
reseted by deleting target on the server.
And async_block_hwm is very high when there is writhing on pool and gets
stuck:
async_block_hwm 1402834
[email protected]:/export/home/roman/zfs# zpool iostat 5
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 3.27K 2.80K
zstor 8.99G 10.9T 0 4 35 451K
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 499 0 61.9M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 21 0 2.67M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 18 0 2.37M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 25 0 3.24M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 18 0 2.27M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 26 0 3.37M
---------- ----- ----- ----- ----- ----- -----
zpool iostat show some writing but the initiator doesn't respond on
windows box at this time.
And this is what kstat shows:
[email protected]:# kstat sndr:0:setinfo
module: sndr instance: 0
name: setinfo class: storedge
async_block_hwm 1402834
async_item_hwm 16417
async_queue_blocks *1390128*
async_queue_items 16382
async_queue_type memory
async_throttle_delay 8135137
autosync 0
bitmap /dev/md/rdsk/bmp0
bitsset 2674
bmpflags 0
bmp_size 5713920
crtime 11243.852414994
disk_status 0
flags 2054
if_down 0
if_rpc_version 7
maxqfbas 25000000
maxqitems 16384
primary_host tor2.flt2
primary_vol /dev/rdsk/c3t0d0s0
secondary_host mtl2.flt2
secondary_vol /dev/rdsk/c3t0d0s0
snaptime 76822.27391977
syncflags 0
syncpos 2925489600
type_flag 5
volsize 2925489887
So, the question for me now if avs available to handle burst writing
and replicate changes over the slow and hight latency link to the other
side without affecting performance on the primary host? And it should
track changes during long periods when the link is down.
If it's still possible to use avs with 2 hosts contain 8x1.36G volumes,
max writing speed is about 30-40 Mb/s and the circuit is 100Mb link
with 15-20 ms latency? In async mode obviously
Another question: if link was down for a long time and blocks have
changed couple of time over the downtime - how sndr replicated
sequenced writes?
It doesn't.
SNDR has three modes of operation:
logging mode
(re)synchronization mode
replicating mode
In logging mode and replicating mode, SNDR keeps the both SNDR
primary and secondary volumes in write-consistent (sequenced) order.
During (re)synchronization mode, SNDR updates the secondary volume in
a block (actually bitmap) order. There is only a single bit used to
track differences between the primary and secondary volumes, and one
bit is equal to 32KB.
If the replication environment, volume content and
application availability need be concerned that an SNDR replica is not
write-order consistent during (re)synchronization mode, SNDR supports
an option call ndr_ii, what takes an automatic compact dependent
snapshot of the write-order consistent SNDR secondary volume, and if
in the unlikely case that (re)synchronization mode fails, and the SNDR
primary volume is lost, the snapshot volume can be restore on the SNDR
secondary.
Does it mean that during re(syncronization) mode volumes shouldn't be
mounted because the write order is not consistent?
Roman
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss