Re: [storage-discuss] AVS - SNDR: Recovery bitmaps not allocated

Roman Naumenko Fri, 24 Apr 2009 17:36:34 -0700

Jim Dunham wrote, On 04/24/2009 06:46 PM:

Roman,
Thanks for your help!
Actually I have some more questions. I need to make a decision onreplication mode for our storages: zfs send-receive, avs or evenmicrosoft internal tool on the iscsi volumes with independent zfssnashots on both side.Initially avs seemed to me a good options, but I can't make itworking on 100Mb link with 8x1.36Tb volumes
It's strange that sndr sends entire bitmap - what if one is for abig replicated volumes like for 1.36Gb? It's more that 100000blocks for async replication, .
There will be constant timeouts on average 100M link in this case.
SNDR does not replicate the bitmap volume, just the bitmap itself.There is one bit per 32KB of primary volume size, with 8 bits perbyte, and 512 bytes per block. The answer for 1.36GB is just 11.04blocks, or 5.5KB.
Of course looking at the example below, the math for replicating TBsverses GBs is 1024 times larger.1.36 TB * (1 bit / 32KB) * (8 bytes / bit) * (1 block / 512 bytes) =11161 blocks, which is the value reported below for non-disk queuereplication with SNDR.
But the dsbitmap shows 100441 blocks for async replication, I'mmissing something?
Yes you did, the words "disk queue". When replicating with a diskqueue, there is an addition requirement for storing 1-byte or 4-bytereference counter per bit. These reference counters are separate fromthe actual bitmap, and are not exchanged between SNDR primary andsecondary nodes.
Required bitmap volume size:
  Sync replication: 11161 blocks
  Async replication with memory queue: 11161 blocks
  Async replication with disk queue: 100441 blocks
  Async replication with disk queue and 32 bit refcount: 368281 blocks
Good. I kind of figure that this was the problem. What are is yourSNDR primary volume size?
After initial sync started to work (although it's a very slow processand it takes 10-15 mins to complete) I have the following situation:
Below you made reference to using 'ping' with a result of "64 bytesfrom mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms". It would beinteresting to know the results of "ping -s mtl2.flt2 8192", where8192 is the chunk size for exchanging bitmaps.

The same with this size in both sides.

The reason I mention this is that "64 bytes / 16.822 ms" is ~3.7KB/sec. With 8 * 11161 blocks * 512 bytes, it would take ~25 minutesto exchange bitmaps with the level of link latency and constrainedbandwidth your are testing with.

It should have at least the same speed as for replication itself: 2.5-3Mbite/sBut definitely the latency causes the problem, with low latency itinitializes very quick.

1. Storage (8x1.36Tb in one raidz2 pool)
[email protected]# sndradm -i
tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2/dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool
.................
tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2/dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pooltor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2/dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool
When you state the " initial sync " takes 10-15 minutes to complete,what did you do to measure this 10-15 minutes?
Do you know that when using I/O consistency groups, one can also mangeall the replicas in groups with a single "-g" group command like:
sndradm -g zfs-pool -nu

Yes, I did replication for I/O group everywhere.

2. Bitmaps are on the mirrored metadevice, they are bigger than youmentioned but this is what dsbitmap shows for volumes:
Not an issue, SNDR ignores what it does not need.
bmp0: Soft Partition
    Device: d100
    State: Okay
    Size: 100441 blocks (49 MB)
    Extent            & nbsp; Start Block            &n bsp; Block count
         0                        34              ;      100441

3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2

4. Latency:

[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms
I'm emulating on Freebsd the actual delay and the speed for the realcircuit which is 100Mb and 16ms.
See comments above.
5. The queue during writes with the speed 40Mbite/s on the main host:

[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
    async_block_hwm          &nbs p;      1402834
    .................
    async_block_hwm          &nbs p;      1402834
SNDR's memory and disk queues are adjustable. The two commands are:
sndradm [opts] -F <maxqfbas> [set] set maximum fbas(blocks) to queuesndradm [opts] -W <maxwrites> [set] set maximum writes(items) to queue
These commands set the high-water marks for both number of blocks andnumber of items in the memory queue. These are high-water marks, nothard stops, so it is possible for SNDR to exceed these values base oncurrent in progress I/Os.
    maxqfbas                            25000000
    maxqitems                       & nbsp;   16384

I see that maxqfbas has been set, but you also need to increase maxqitems.
Be-forewarned that guessing a value could have serious memoryimplications. The value of 25000000 is in blocks, 512 bytes. Thismeans that you are allowing SNDR to queue up approximately 12GB ofkernel memory to hold unreplicated data. You may have this muchuncommitted memory sitting idle, but if not, as SNDR tries to consumethis memory, it will impose some serious memory demands on yoursystem. Fortunately you didn't guess a maxqitems value too.

I managed to hang up server with too big queue. It didn't respond to anyqueries and keys pressing but when sndr went to logging by unpluggednetwork it returned to life :)

As shown with either "dsstat -m sndr -d q", or with kstat data, thesevalue are monitored with runtime counters. The low value of maxqitems,is likely the item causing the slow performance issues below. Wheneverthe replica hits either of these limits, back pressure (flow-control)is imposed on the application or data services writing to the SNDRprimary volumes.
    async_block_hwm         &nbs p;       1402834
    async_item_hwm            ;       16417
    async_queue_blocks          & nbsp;   1390128
    async_queue_items          &n bsp;    16382
In addition to adjusting these memory queue limits to more reasonablevalues, the next option is to increase the number of asynchronousflusher threads form the default of 2. It is applications, filesystemsor databases that fill these SNDR memory or disk queues,the asynchronous flusher threads empty these queues. Of course themore threads emptying the queues, increases SNDR's memory, network andCPU demands to perform replication.sndradm [opts] -A <asyncthr> [set] set the number ofasynchronous threads
NOTE: If replicating based on I/O consistency groups, increasing onereplica's thread count, increase across all members of the group.
The method for increasing this value is to double the current valuetimes two, until there is no measured improvement in replicationperformance. Then decrease this new value by 1/2.If it is determined that replication demands still force memory queuefull conditions, the next option is to switch from memory to diskqueues. Of course if on average the amount of change caused by writesI/Os, exceed the average network replication rate, the neither memoryqueues or disk queues will help.

That's what I figured out today. The queue despite how big it was filledup and initiators just stopped to work (windows completely and Linuxcontinued to work after queue lowered).I played with mentioned options, but definitely the problem was notrelated to queues tuning, I just need another type of replication.

The speed increased a little when I changed number of threads to 4, butstill queue filled up. On the 100Mbit I got 25Mbit/s synchronizationspeed (replication mode)

At this point time-fixed replication is your next option. This iswhere a full or incremental snapshot is taken at some interval, andthen just the changes between snapshots are replicated. This issimilar as to what ZFS does with snapshot, send and receive, but withSNDR it is at the block level, where with ZFS its at the filesystem level.
AVS is both SNDR (remote replication) and II (point in time copy, orsnapshots), and the two data services are integrated so thatincremental replication of snapshots is easy.

Well, I'll try this II in the near future and post results here, in casesomebody else is interested.And I wanted to ask you if there another software from SUN that can beused for such task - replication on high latency links?


Roman

_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] AVS - SNDR: Recovery bitmaps not allocated

Reply via email to