Jim Dunham wrote, On 04/24/2009 06:46 PM:
Roman,
Thanks for your help!

Actually I have some more questions. I need to make a decision on replication mode for our storages: zfs send-receive, avs or even microsoft internal tool on the iscsi volumes with independent zfs snashots on both side. Initially avs seemed to me a good options, but I can't make it working on 100Mb link with 8x1.36Tb volumes

It's strange that sndr sends entire bitmap - what if one is for a big replicated volumes like for 1.36Gb? It's more that 100000 blocks for async replication, .
There will be constant timeouts on average 100M link in this case.
SNDR does not replicate the bitmap volume, just the bitmap itself. There is one bit per 32KB of primary volume size, with 8 bits per byte, and 512 bytes per block. The answer for 1.36GB is just 11.04 blocks, or 5.5KB.
Of course looking at the example below, the math for replicating TBs verses GBs is 1024 times larger. 1.36 TB * (1 bit / 32KB) * (8 bytes / bit) * (1 block / 512 bytes) = 11161 blocks, which is the value reported below for non-disk queue replication with SNDR.
But the dsbitmap shows 100441 blocks for async replication, I'm missing something?
Yes you did, the words "disk queue". When replicating with a disk queue, there is an addition requirement for storing 1-byte or 4-byte reference counter per bit. These reference counters are separate from the actual bitmap, and are not exchanged between SNDR primary and secondary nodes.
Required bitmap volume size:
  Sync replication: 11161 blocks
  Async replication with memory queue: 11161 blocks
  Async replication with disk queue: 100441 blocks
  Async replication with disk queue and 32 bit refcount: 368281 blocks
Good. I kind of figure that this was the problem. What are is your SNDR primary volume size?
After initial sync started to work (although it's a very slow process and it takes 10-15 mins to complete) I have the following situation:
Below you made reference to using 'ping' with a result of "64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms". It would be interesting to know the results of "ping -s mtl2.flt2 8192", where 8192 is the chunk size for exchanging bitmaps.
The same with this size in both sides.
The reason I mention this is that "64 bytes / 16.822 ms" is ~3.7 KB/sec. With 8 * 11161 blocks * 512 bytes, it would take ~25 minutes to exchange bitmaps with the level of link latency and constrained bandwidth your are testing with.
It should have at least the same speed as for replication itself: 2.5-3 Mbite/s But definitely the latency causes the problem, with low latency it initializes very quick.
1. Storage (8x1.36Tb in one raidz2 pool)
[email protected]# sndradm -i

tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool
.................
tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pool tor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool
When you state the " initial sync " takes 10-15 minutes to complete, what did you do to measure this 10-15 minutes?

Do you know that when using I/O consistency groups, one can also mange all the replicas in groups with a single "-g" group command like:

sndradm -g zfs-pool -nu
Yes, I did replication for I/O group everywhere.
2. Bitmaps are on the mirrored metadevice, they are bigger than you mentioned but this is what dsbitmap shows for volumes:
Not an issue, SNDR ignores what it does not need.
bmp0: Soft Partition
    Device: d100
    State: Okay
    Size: 100441 blocks (49 MB)
    Extent            & nbsp; Start Block            &n bsp; Block count
         0                        34              ;      100441

3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2

4. Latency:

[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms

I'm emulating on Freebsd the actual delay and the speed for the real circuit which is 100Mb and 16ms.
See comments above.
5. The queue during writes with the speed 40Mbite/s on the main host:

[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
    async_block_hwm          &nbs p;      1402834
    .................
    async_block_hwm          &nbs p;      1402834
SNDR's memory and disk queues are adjustable. The two commands are:

sndradm [opts] -F <maxqfbas> [set] set maximum fbas (blocks) to queue sndradm [opts] -W <maxwrites> [set] set maximum writes (items) to queue

These commands set the high-water marks for both number of blocks and number of items in the memory queue. These are high-water marks, not hard stops, so it is possible for SNDR to exceed these values base on current in progress I/Os.

    maxqfbas                            25000000
    maxqitems                       & nbsp;   16384

I see that maxqfbas has been set, but you also need to increase maxqitems.

Be-forewarned that guessing a value could have serious memory implications. The value of 25000000 is in blocks, 512 bytes. This means that you are allowing SNDR to queue up approximately 12GB of kernel memory to hold unreplicated data. You may have this much uncommitted memory sitting idle, but if not, as SNDR tries to consume this memory, it will impose some serious memory demands on your system. Fortunately you didn't guess a maxqitems value too.
I managed to hang up server with too big queue. It didn't respond to any queries and keys pressing but when sndr went to logging by unplugged network it returned to life :)
As shown with either "dsstat -m sndr -d q", or with kstat data, these value are monitored with runtime counters. The low value of maxqitems, is likely the item causing the slow performance issues below. Whenever the replica hits either of these limits, back pressure (flow-control) is imposed on the application or data services writing to the SNDR primary volumes.
    async_block_hwm         &nbs p;       1402834
    async_item_hwm            ;       16417
    async_queue_blocks          & nbsp;   1390128
    async_queue_items          &n bsp;    16382

In addition to adjusting these memory queue limits to more reasonable values, the next option is to increase the number of asynchronous flusher threads form the default of 2. It is applications, filesystems or databases that fill these SNDR memory or disk queues, the asynchronous flusher threads empty these queues. Of course the more threads emptying the queues, increases SNDR's memory, network and CPU demands to perform replication. sndradm [opts] -A <asyncthr> [set] set the number of asynchronous threads

NOTE: If replicating based on I/O consistency groups, increasing one replica's thread count, increase across all members of the group.

The method for increasing this value is to double the current value times two, until there is no measured improvement in replication performance. Then decrease this new value by 1/2. If it is determined that replication demands still force memory queue full conditions, the next option is to switch from memory to disk queues. Of course if on average the amount of change caused by writes I/Os, exceed the average network replication rate, the neither memory queues or disk queues will help.
That's what I figured out today. The queue despite how big it was filled up and initiators just stopped to work (windows completely and Linux continued to work after queue lowered). I played with mentioned options, but definitely the problem was not related to queues tuning, I just need another type of replication.

The speed increased a little when I changed number of threads to 4, but still queue filled up. On the 100Mbit I got 25Mbit/s synchronization speed (replication mode)
At this point time-fixed replication is your next option. This is where a full or incremental snapshot is taken at some interval, and then just the changes between snapshots are replicated. This is similar as to what ZFS does with snapshot, send and receive, but with SNDR it is at the block level, where with ZFS its at the filesystem level.

AVS is both SNDR (remote replication) and II (point in time copy, or snapshots), and the two data services are integrated so that incremental replication of snapshots is easy.
Well, I'll try this II in the near future and post results here, in case somebody else is interested. And I wanted to ask you if there another software from SUN that can be used for such task - replication on high latency links?

Roman
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Reply via email to