Re: [storage-discuss] AVS - SNDR: Recovery bitmaps not allocated

Jim Dunham Fri, 24 Apr 2009 15:53:36 -0700

Roman,

Thanks for your help!
Actually I have some more questions. I need to make a decision onreplication mode for our storages: zfs send-receive, avs or evenmicrosoft internal tool on the iscsi volumes with independent zfssnashots on both side.Initially avs seemed to me a good options, but I can't make itworking on 100Mb link with 8x1.36Tb volumes
It's strange that sndr sends entire bitmap - what if one is for abig replicated volumes like for 1.36Gb? It's more that 100000blocks for async replication, .
There will be constant timeouts on average 100M link in this case.
SNDR does not replicate the bitmap volume, just the bitmap itself.There is one bit per 32KB of primary volume size, with 8 bits perbyte, and 512 bytes per block. The answer for 1.36GB is just 11.04blocks, or 5.5KB.

Of course looking at the example below, the math for replicating TBsverses GBs is 1024 times larger.

1.36 TB * (1 bit / 32KB) * (8 bytes / bit) * (1 block / 512 bytes) =11161 blocks, which is the value reported below for non-disk queuereplication with SNDR.

But the dsbitmap shows 100441 blocks for async replication, I'mmissing something?

Yes you did, the words "disk queue". When replicating with a diskqueue, there is an addition requirement for storing 1-byte or 4-bytereference counter per bit. These reference counters are separate fromthe actual bitmap, and are not exchanged between SNDR primary andsecondary nodes.

Required bitmap volume size:
  Sync replication: 11161 blocks
  Async replication with memory queue: 11161 blocks
  Async replication with disk queue: 100441 blocks
  Async replication with disk queue and 32 bit refcount: 368281 blocks
Good. I kind of figure that this was the problem. What are is yourSNDR primary volume size?
After initial sync started to work (although it's a very slowprocess and it takes 10-15 mins to complete) I have the followingsituation:

Below you made reference to using 'ping' with a result of "64 bytesfrom mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms". It would beinteresting to know the results of "ping -s mtl2.flt2 8192", where8192 is the chunk size for exchanging bitmaps.

The reason I mention this is that "64 bytes / 16.822 ms" is ~3.7 KB/sec. With 8 * 11161 blocks * 512 bytes, it would take ~25 minutes toexchange bitmaps with the level of link latency and constrainedbandwidth your are testing with.

1. Storage (8x1.36Tb in one raidz2 pool):
[email protected]# sndradm -i
tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 mtl2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 mtl2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t3d0s0 /dev//rdsk/bmp3 mtl2.flt2 /dev/rdsk/c3t3d0s0 /dev/md/rdsk/bmp3 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 mtl2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 mtl2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 mtl2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 ip async g zfs-pooltor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pooltor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool

When you state the " initial sync " takes 10-15 minutes to complete,what did you do to measure this 10-15 minutes?

Do you know that when using I/O consistency groups, one can also mangeall the replicas in groups with a single "-g" group command like:


        sndradm -g zfs-pool -nu

2. Bitmaps are on the mirrored metadevice, they are bigger than youmentioned but this is what dsbitmap shows for volumes:


Not an issue, SNDR ignores what it does not need.

bmp0: Soft Partition
    Device: d100
    State: Okay
    Size: 100441 blocks (49 MB)
    Extent              Start Block              Block count
         0                       34                   100441

3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2

4. Latency:

[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms

I'm emulating on Freebsd the actual delay and the speed for the realcircuit which is 100Mb and 16ms.


See comments above.

5. The queue during writes with the speed 40Mbite/s on the main host:

[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834
    async_block_hwm                 1402834


SNDR's memory and disk queues are adjustable. The two commands are:

sndradm [opts] -F <maxqfbas> [set] set maximum fbas(blocks) to queuesndradm [opts] -W <maxwrites> [set] set maximum writes(items) to queue

These commands set the high-water marks for both number of blocks andnumber of items in the memory queue. These are high-water marks, nothard stops, so it is possible for SNDR to exceed these values base oncurrent in progress I/Os.


    maxqfbas                            25000000
    maxqitems                           16384

I see that maxqfbas has been set, but you also need to increasemaxqitems.

Be-forewarned that guessing a value could have serious memoryimplications. The value of 25000000 is in blocks, 512 bytes. Thismeans that you are allowing SNDR to queue up approximately 12GB ofkernel memory to hold unreplicated data. You may have this muchuncommitted memory sitting idle, but if not, as SNDR tries to consumethis memory, it will impose some serious memory demands on yoursystem. Fortunately you didn't guess a maxqitems value too.

As shown with either "dsstat -m sndr -d q", or with kstat data, thesevalue are monitored with runtime counters. The low value of maxqitems,is likely the item causing the slow performance issues below. Wheneverthe replica hits either of these limits, back pressure (flow-control)is imposed on the application or data services writing to the SNDRprimary volumes.


    async_block_hwm                 1402834
    async_item_hwm                  16417
    async_queue_blocks              1390128
    async_queue_items               16382

In addition to adjusting these memory queue limits to more reasonablevalues, the next option is to increase the number of asynchronousflusher threads form the default of 2. It is applications, filesystemsor databases that fill these SNDR memory or disk queues, theasynchronous flusher threads empty these queues. Of course the morethreads emptying the queues, increases SNDR's memory, network and CPUdemands to perform replication.

sndradm [opts] -A <asyncthr> [set] set the number ofasynchronous threads

NOTE: If replicating based on I/O consistency groups, increasing onereplica's thread count, increase across all members of the group.

The method for increasing this value is to double the current valuetimes two, until there is no measured improvement in replicationperformance. Then decrease this new value by 1/2.

If it is determined that replication demands still force memory queuefull conditions, the next option is to switch from memory to diskqueues. Of course if on average the amount of change caused by writesI/Os, exceed the average network replication rate, the neither memoryqueues or disk queues will help.

At this point time-fixed replication is your next option. This iswhere a full or incremental snapshot is taken at some interval, andthen just the changes between snapshots are replicated. This issimilar as to what ZFS does with snapshot, send and receive, but withSNDR it is at the block level, where with ZFS its at the filesystemlevel.

AVS is both SNDR (remote replication) and II (point in time copy, orsnapshots), and the two data services are integrated so thatincremental replication of snapshots is easy.

The problems:

The problems below are likely due to the too low setting of maxqitemsat 16KB, compounded by running with the default of two asynchronousflusher threads.


- Jim

1.
In replication mode data transmission on freebsd is only 2.5 Mbite/sfor rcp traffic which is quite lower the numbers netio shows:
----------------- Real link, 100Mb; 16 delay -------------
TCP connection established.
Packet size  1k bytes:  3239 KByte/s Tx,  5885 KByte/s Rx.

2.
When initial synchronization (sndradm -nu) happens it's traffic isalmost zero.But, if it's gig connection on the switch then the syncro is prettyfast, maybe dozed of seconds instead of minutes.
3.
But the real problem is that iscsi initiator writings stale becauseof sndr replication. The windows initiator just hungs up and can beonly reseted by deleting target on the server.And async_block_hwm is very high when there is writhing on pool andgets stuck:
async_block_hwm                 1402834

[email protected]:/export/home/roman/zfs# zpool  iostat 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0  3.27K  2.80K
zstor       8.99G  10.9T      0      4     35   451K
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0    499      0  61.9M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     21      0  2.67M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     18      0  2.37M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     25      0  3.24M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     18      0  2.27M
----------  -----  -----  -----  -----  -----  -----
rpool       10.6G  63.4G      0      0      0      0
zstor       8.99G  10.9T      0     26      0  3.37M
----------  -----  -----  -----  -----  -----  -----
zpool iostat show some writing but the initiator doesn't respond onwindows box at this time.
And this is what kstat shows:

[email protected]:# kstat sndr:0:setinfo
module: sndr                            instance: 0
name:   setinfo                         class:    storedge
    async_block_hwm                 1402834
    async_item_hwm                  16417
    async_queue_blocks              1390128
    async_queue_items               16382
    async_queue_type                memory
    async_throttle_delay            8135137
    autosync                                0
    bitmap                                  /dev/md/rdsk/bmp0
    bitsset                                 2674
    bmpflags                                0
    bmp_size                            5713920
    crtime                                  11243.852414994
    disk_status                         0
    flags                                   2054
    if_down                             0
    if_rpc_version                      7
    maxqfbas                            25000000
    maxqitems                           16384
    primary_host                    tor2.flt2
    primary_vol                     /dev/rdsk/c3t0d0s0
    secondary_host                  mtl2.flt2
    secondary_vol                   /dev/rdsk/c3t0d0s0
    snaptime                        76822.27391977
    syncflags                       0
    syncpos                         2925489600
    type_flag                       5
    volsize                         2925489887
So, the question for me now if avs available to handle burstwriting and replicate changes over the slow and hight latency linkto the other side without affecting performance on the primary host?And it should track changes during long periods when the link is down.
If it's still possible to use avs with 2 hosts contain 8x1.36Gvolumes, max writing speed is about 30-40 Mb/s and the circuit is100Mb link with 15-20 ms latency? In async mode obviously
Another question: if link was down for a long time and blocks havechanged couple of time over the downtime - how sndr replicatedsequenced writes?
It doesn't.

SNDR has three modes of operation:
 logging mode
 (re)synchronization mode
 replicating mode
In logging mode and replicating mode, SNDR keeps the both SNDRprimary and secondary volumes in write-consistent (sequenced) order.
During (re)synchronization mode, SNDR updates the secondary volumein a block (actually bitmap) order. There is only a single bit usedto track differences between the primary and secondary volumes, andone bit is equal to 32KB.
If the replication environment, volume content and applicationavailability need be concerned that an SNDR replica is not write-order consistent during (re)synchronization mode, SNDR supports anoption call ndr_ii, what takes an automatic compact dependentsnapshot of the write-order consistent SNDR secondary volume, andif in the unlikely case that (re)synchronization mode fails, andthe SNDR primary volume is lost, the snapshot volume can be restoreon the SNDR secondary.
Does it mean that during re(syncronization) mode volumes shouldn'tbe mounted because the write order is not consistent?
Roman

_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss

Re: [storage-discuss] AVS - SNDR: Recovery bitmaps not allocated

Reply via email to