Thanks for your help!

Actually I have some more questions. I need to make a decision on replication mode for our storages: zfs send-receive, avs or even microsoft internal tool on the iscsi volumes with independent zfs snashots on both side. Initially avs seemed to me a good options, but I can't make it working on 100Mb link with 8x1.36Tb volumes

Roman,

> A weird issue:
> 1. avs works for connections on a local switch via local freebsd > router connected to the switch
>  host1 -> switch -> router freebsd -> switch -> host2
> 2. When trying to emulate replication using far distance remote > connection with the freebsd router on the remote side then AVS fails > with the error:
> [b]sndradm: warning: SNDR: Recovery bitmaps not allocated
> [/b]

First of all, what version of AVS / OpenSolaris are you running? The reason I ask, is that this error message being returned from "sndradm", was a problem partially resolved for AVS on Solaris 10, or AVS bundled with OpenSolaris.

# sndradm -v
Remote Mirror version 11.11
# uname -a
SunOS tor.flt 5.11 snv_101b i86pc i386 i86pc Solaris

The specific issue at hand, is that during the first stages of an "sndradm -u ...", update command, the SNDR secondary node is asked to send its entire bitmap to the SNDR primary node. The operation is done via a Solaris RPC call, an operation which has an associated timeout value. If the amount of time it takes to send this data over the network from the secondary node to primary node, exceeds the RPC timeout value, the operation fails with a "Recovery bitmaps not allocated".

It's strange that sndr sends entire bitmap - what if one is for a big replicated volumes like for 1.36Gb? It's more that 100000 blocks for async replication, .
There will be constant timeouts on average 100M link in this case.

SNDR does not replicate the bitmap volume, just the bitmap itself. There is one bit per 32KB of primary volume size, with 8 bits per byte, and 512 bytes per block. The answer for 1.36GB is just 11.04 blocks, or 5.5KB.

But the dsbitmap shows 100441 blocks for async replication, I'm missing something?

Required bitmap volume size:
 Sync replication: 11161 blocks
 Async replication with memory queue: 11161 blocks
 Async replication with disk queue: 100441 blocks
 Async replication with disk queue and 32 bit refcount: 368281 blocks

Good. I kind of figure that this was the problem. What are is your SNDR primary volume size?
After initial sync started to work (although it's a very slow process and it takes 10-15 mins to complete) I have the following situation:

1. Storage (8x1.36Tb in one raidz2 pool):
[email protected]# sndradm -i

tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 mtl2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 mtl2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t3d0s0 /dev/md/rdsk/bmp3 mtl2.flt2 /dev/rdsk/c3t3d0s0 /dev/md/rdsk/bmp3 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 mtl2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 mtl2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 mtl2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pool tor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool

2. Bitmaps are on the mirrored metadevice, they are bigger than you mentioned but this is what dsbitmap shows for volumes:

bmp0: Soft Partition
   Device: d100
   State: Okay
   Size: 100441 blocks (49 MB)
   Extent              Start Block              Block count
        0                       34                   100441

3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2

4. Latency:

[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms

I'm emulating on Freebsd the actual delay and the speed for the real circuit which is 100Mb and 16ms.

5. The queue during writes with the speed 40Mbite/s on the main host:

[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834
   async_block_hwm                 1402834

The problems:

1.
In replication mode data transmission on freebsd is only 2.5 Mbite/s for rcp traffic which is quite lower the numbers netio shows:

----------------- Real link, 100Mb; 16 delay -------------
TCP connection established.
Packet size  1k bytes:  3239 KByte/s Tx,  5885 KByte/s Rx.

2.
When initial synchronization (sndradm -nu) happens it's traffic is almost zero. But, if it's gig connection on the switch then the syncro is pretty fast, maybe dozed of seconds instead of minutes.

3.
But the real problem is that iscsi initiator writings stale because of sndr replication. The windows initiator just hungs up and can be only reseted by deleting target on the server. And async_block_hwm is very high when there is writhing on pool and gets stuck:
async_block_hwm                 1402834

[email protected]:/export/home/roman/zfs# zpool  iostat 5
              capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0  3.27K  2.80K
zstor       8.99G  10.9T      0      4     35   451K
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0    499      0  61.9M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     21      0  2.67M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     18      0  2.37M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     25      0  3.24M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     18      0  2.27M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     26      0  3.37M
----------  -----  -----  -----  -----  -----  -----

zpool iostat show some writing but the initiator doesn't respond on windows box at this time.

And this is what kstat shows:

[email protected]:# kstat sndr:0:setinfo
module: sndr instance: 0 name: setinfo class: storedge
   async_block_hwm                 1402834
   async_item_hwm                  16417
   async_queue_blocks              *1390128*
   async_queue_items               16382
   async_queue_type                memory
   async_throttle_delay            8135137
   autosync                                0
   bitmap                                  /dev/md/rdsk/bmp0
   bitsset                                 2674
   bmpflags                                0
   bmp_size                            5713920
   crtime                                  11243.852414994
   disk_status                         0
   flags                                   2054
   if_down                             0
   if_rpc_version                      7
   maxqfbas                            25000000
   maxqitems                           16384
   primary_host                    tor2.flt2
   primary_vol                     /dev/rdsk/c3t0d0s0
   secondary_host                  mtl2.flt2
   secondary_vol                   /dev/rdsk/c3t0d0s0
   snaptime                        76822.27391977
   syncflags                       0
   syncpos                         2925489600
   type_flag                       5
   volsize                         2925489887

So, the question for me now if avs available to handle burst writing and replicate changes over the slow and hight latency link to the other side without affecting performance on the primary host? And it should track changes during long periods when the link is down.

If it's still possible to use avs with 2 hosts contain 8x1.36G volumes, max writing speed is about 30-40 Mb/s and the circuit is 100Mb link with 15-20 ms latency? In async mode obviously

Another question: if link was down for a long time and blocks have changed couple of time over the downtime - how sndr replicated sequenced writes?
It doesn't.

SNDR has three modes of operation:
logging mode
(re)synchronization mode
replicating mode

In logging mode and replicating mode, SNDR keeps the both SNDR primary and secondary volumes in write-consistent (sequenced) order.

During (re)synchronization mode, SNDR updates the secondary volume in a block (actually bitmap) order. There is only a single bit used to track differences between the primary and secondary volumes, and one bit is equal to 32KB.

If the replication environment, volume content and application availability need be concerned that an SNDR replica is not write-order consistent during (re)synchronization mode, SNDR supports an option call ndr_ii, what takes an automatic compact dependent snapshot of the write-order consistent SNDR secondary volume, and if in the unlikely case that (re)synchronization mode fails, and the SNDR primary volume is lost, the snapshot volume can be restore on the SNDR secondary.
Does it mean that during re(syncronization) mode volumes shouldn't be mounted because the write order is not consistent?

Roman


_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to