Roman,

Thanks for your help!

Actually I have some more questions. I need to make a decision on replication mode for our storages: zfs send-receive, avs or even microsoft internal tool on the iscsi volumes with independent zfs snashots on both side. Initially avs seemed to me a good options, but I can't make it working on 100Mb link with 8x1.36Tb volumes

It's strange that sndr sends entire bitmap - what if one is for a big replicated volumes like for 1.36Gb? It's more that 100000 blocks for async replication, .
There will be constant timeouts on average 100M link in this case.

SNDR does not replicate the bitmap volume, just the bitmap itself. There is one bit per 32KB of primary volume size, with 8 bits per byte, and 512 bytes per block. The answer for 1.36GB is just 11.04 blocks, or 5.5KB.

Of course looking at the example below, the math for replicating TBs verses GBs is 1024 times larger.

1.36 TB * (1 bit / 32KB) * (8 bytes / bit) * (1 block / 512 bytes) = 11161 blocks, which is the value reported below for non-disk queue replication with SNDR.

But the dsbitmap shows 100441 blocks for async replication, I'm missing something?

Yes you did, the words "disk queue". When replicating with a disk queue, there is an addition requirement for storing 1-byte or 4-byte reference counter per bit. These reference counters are separate from the actual bitmap, and are not exchanged between SNDR primary and secondary nodes.

Required bitmap volume size:
  Sync replication: 11161 blocks
  Async replication with memory queue: 11161 blocks
  Async replication with disk queue: 100441 blocks
  Async replication with disk queue and 32 bit refcount: 368281 blocks


Good. I kind of figure that this was the problem. What are is your SNDR primary volume size?
After initial sync started to work (although it's a very slow process and it takes 10-15 mins to complete) I have the following situation:

Below you made reference to using 'ping' with a result of "64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms". It would be interesting to know the results of "ping -s mtl2.flt2 8192", where 8192 is the chunk size for exchanging bitmaps.

The reason I mention this is that "64 bytes / 16.822 ms" is ~3.7 KB/ sec. With 8 * 11161 blocks * 512 bytes, it would take ~25 minutes to exchange bitmaps with the level of link latency and constrained bandwidth your are testing with.

1. Storage (8x1.36Tb in one raidz2 pool):
[email protected]# sndradm -i

tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2 /dev/rdsk/ c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 mtl2.flt2 /dev/rdsk/ c3t1d0s0 /dev/md/rdsk/bmp1 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 mtl2.flt2 /dev/rdsk/ c3t2d0s0 /dev/md/rdsk/bmp2 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t3d0s0 /dev//rdsk/bmp3 mtl2.flt2 /dev/rdsk/ c3t3d0s0 /dev/md/rdsk/bmp3 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 mtl2.flt2 /dev/rdsk/ c3t4d0s0 /dev/md/rdsk/bmp4 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 mtl2.flt2 /dev/rdsk/ c3t5d0s0 /dev/md/rdsk/bmp5 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 mtl2.flt2 /dev/rdsk/ c3t6d0s0 /dev/md/rdsk/bmp6 ip async g zfs-pool tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2 /dev/rdsk/ c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pool tor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2 /dev/rdsk/ c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool

When you state the " initial sync " takes 10-15 minutes to complete, what did you do to measure this 10-15 minutes?

Do you know that when using I/O consistency groups, one can also mange all the replicas in groups with a single "-g" group command like:

        sndradm -g zfs-pool -nu

2. Bitmaps are on the mirrored metadevice, they are bigger than you mentioned but this is what dsbitmap shows for volumes:

Not an issue, SNDR ignores what it does not need.

bmp0: Soft Partition
    Device: d100
    State: Okay
    Size: 100441 blocks (49 MB)
    Extent              Start Block              Block count
         0                       34                   100441

3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2

4. Latency:

[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms

I'm emulating on Freebsd the actual delay and the speed for the real circuit which is 100Mb and 16ms.

See comments above.

5. The queue during writes with the speed 40Mbite/s on the main host:

[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834

SNDR's memory and disk queues are adjustable. The two commands are:

sndradm [opts] -F <maxqfbas> [set] set maximum fbas (blocks) to queue sndradm [opts] -W <maxwrites> [set] set maximum writes (items) to queue

These commands set the high-water marks for both number of blocks and number of items in the memory queue. These are high-water marks, not hard stops, so it is possible for SNDR to exceed these values base on current in progress I/Os.

    maxqfbas                            25000000
    maxqitems                           16384

I see that maxqfbas has been set, but you also need to increase maxqitems.

Be-forewarned that guessing a value could have serious memory implications. The value of 25000000 is in blocks, 512 bytes. This means that you are allowing SNDR to queue up approximately 12GB of kernel memory to hold unreplicated data. You may have this much uncommitted memory sitting idle, but if not, as SNDR tries to consume this memory, it will impose some serious memory demands on your system. Fortunately you didn't guess a maxqitems value too.

As shown with either "dsstat -m sndr -d q", or with kstat data, these value are monitored with runtime counters. The low value of maxqitems, is likely the item causing the slow performance issues below. Whenever the replica hits either of these limits, back pressure (flow-control) is imposed on the application or data services writing to the SNDR primary volumes.

    async_block_hwm                 1402834
    async_item_hwm                  16417
    async_queue_blocks              1390128
    async_queue_items               16382

In addition to adjusting these memory queue limits to more reasonable values, the next option is to increase the number of asynchronous flusher threads form the default of 2. It is applications, filesystems or databases that fill these SNDR memory or disk queues, the asynchronous flusher threads empty these queues. Of course the more threads emptying the queues, increases SNDR's memory, network and CPU demands to perform replication.

sndradm [opts] -A <asyncthr> [set] set the number of asynchronous threads

NOTE: If replicating based on I/O consistency groups, increasing one replica's thread count, increase across all members of the group.

The method for increasing this value is to double the current value times two, until there is no measured improvement in replication performance. Then decrease this new value by 1/2.

If it is determined that replication demands still force memory queue full conditions, the next option is to switch from memory to disk queues. Of course if on average the amount of change caused by writes I/Os, exceed the average network replication rate, the neither memory queues or disk queues will help.

At this point time-fixed replication is your next option. This is where a full or incremental snapshot is taken at some interval, and then just the changes between snapshots are replicated. This is similar as to what ZFS does with snapshot, send and receive, but with SNDR it is at the block level, where with ZFS its at the filesystem level.

AVS is both SNDR (remote replication) and II (point in time copy, or snapshots), and the two data services are integrated so that incremental replication of snapshots is easy.

The problems:

The problems below are likely due to the too low setting of maxqitems at 16KB, compounded by running with the default of two asynchronous flusher threads.

- Jim




1.
In replication mode data transmission on freebsd is only 2.5 Mbite/s for rcp traffic which is quite lower the numbers netio shows:

----------------- Real link, 100Mb; 16 delay -------------
TCP connection established.
Packet size  1k bytes:  3239 KByte/s Tx,  5885 KByte/s Rx.

2.
When initial synchronization (sndradm -nu) happens it's traffic is almost zero. But, if it's gig connection on the switch then the syncro is pretty fast, maybe dozed of seconds instead of minutes.

3.
But the real problem is that iscsi initiator writings stale because of sndr replication. The windows initiator just hungs up and can be only reseted by deleting target on the server. And async_block_hwm is very high when there is writhing on pool and gets stuck:
async_block_hwm                 1402834

[email protected]:/export/home/roman/zfs# zpool  iostat 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0  3.27K  2.80K
zstor       8.99G  10.9T      0      4     35   451K
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0    499      0  61.9M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     21      0  2.67M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     18      0  2.37M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     25      0  3.24M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     18      0  2.27M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     26      0  3.37M
----------  -----  -----  -----  -----  -----  -----

zpool iostat show some writing but the initiator doesn't respond on windows box at this time.

And this is what kstat shows:

[email protected]:# kstat sndr:0:setinfo
module: sndr                            instance: 0
name:   setinfo                         class:    storedge
    async_block_hwm                 1402834
    async_item_hwm                  16417
    async_queue_blocks              1390128
    async_queue_items               16382
    async_queue_type                memory
    async_throttle_delay            8135137
    autosync                                0
    bitmap                                  /dev/md/rdsk/bmp0
    bitsset                                 2674
    bmpflags                                0
    bmp_size                            5713920
    crtime                                  11243.852414994
    disk_status                         0
    flags                                   2054
    if_down                             0
    if_rpc_version                      7
    maxqfbas                            25000000
    maxqitems                           16384
    primary_host                    tor2.flt2
    primary_vol                     /dev/rdsk/c3t0d0s0
    secondary_host                  mtl2.flt2
    secondary_vol                   /dev/rdsk/c3t0d0s0
    snaptime                        76822.27391977
    syncflags                       0
    syncpos                         2925489600
    type_flag                       5
    volsize                         2925489887

So, the question for me now if avs available to handle burst writing and replicate changes over the slow and hight latency link to the other side without affecting performance on the primary host? And it should track changes during long periods when the link is down.

If it's still possible to use avs with 2 hosts contain 8x1.36G volumes, max writing speed is about 30-40 Mb/s and the circuit is 100Mb link with 15-20 ms latency? In async mode obviously

Another question: if link was down for a long time and blocks have changed couple of time over the downtime - how sndr replicated sequenced writes?
It doesn't.

SNDR has three modes of operation:
 logging mode
 (re)synchronization mode
 replicating mode

In logging mode and replicating mode, SNDR keeps the both SNDR primary and secondary volumes in write-consistent (sequenced) order.

During (re)synchronization mode, SNDR updates the secondary volume in a block (actually bitmap) order. There is only a single bit used to track differences between the primary and secondary volumes, and one bit is equal to 32KB.

If the replication environment, volume content and application availability need be concerned that an SNDR replica is not write- order consistent during (re)synchronization mode, SNDR supports an option call ndr_ii, what takes an automatic compact dependent snapshot of the write-order consistent SNDR secondary volume, and if in the unlikely case that (re)synchronization mode fails, and the SNDR primary volume is lost, the snapshot volume can be restore on the SNDR secondary.
Does it mean that during re(syncronization) mode volumes shouldn't be mounted because the write order is not consistent?

Roman
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to