Roman,
Thanks for your help!
Actually I have some more questions. I need to make a decision on
replication mode for our storages: zfs send-receive, avs or even
microsoft internal tool on the iscsi volumes with independent zfs
snashots on both side.
Initially avs seemed to me a good options, but I can't make it
working on 100Mb link with 8x1.36Tb volumes
It's strange that sndr sends entire bitmap - what if one is for a
big replicated volumes like for 1.36Gb? It's more that 100000
blocks for async replication, .
There will be constant timeouts on average 100M link in this case.
SNDR does not replicate the bitmap volume, just the bitmap itself.
There is one bit per 32KB of primary volume size, with 8 bits per
byte, and 512 bytes per block. The answer for 1.36GB is just 11.04
blocks, or 5.5KB.
Of course looking at the example below, the math for replicating TBs
verses GBs is 1024 times larger.
1.36 TB * (1 bit / 32KB) * (8 bytes / bit) * (1 block / 512 bytes) =
11161 blocks, which is the value reported below for non-disk queue
replication with SNDR.
But the dsbitmap shows 100441 blocks for async replication, I'm
missing something?
Yes you did, the words "disk queue". When replicating with a disk
queue, there is an addition requirement for storing 1-byte or 4-byte
reference counter per bit. These reference counters are separate from
the actual bitmap, and are not exchanged between SNDR primary and
secondary nodes.
Required bitmap volume size:
Sync replication: 11161 blocks
Async replication with memory queue: 11161 blocks
Async replication with disk queue: 100441 blocks
Async replication with disk queue and 32 bit refcount: 368281 blocks
Good. I kind of figure that this was the problem. What are is your
SNDR primary volume size?
After initial sync started to work (although it's a very slow
process and it takes 10-15 mins to complete) I have the following
situation:
Below you made reference to using 'ping' with a result of "64 bytes
from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms". It would be
interesting to know the results of "ping -s mtl2.flt2 8192", where
8192 is the chunk size for exchanging bitmaps.
The reason I mention this is that "64 bytes / 16.822 ms" is ~3.7 KB/
sec. With 8 * 11161 blocks * 512 bytes, it would take ~25 minutes to
exchange bitmaps with the level of link latency and constrained
bandwidth your are testing with.
1. Storage (8x1.36Tb in one raidz2 pool):
[email protected]# sndradm -i
tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2 /dev/rdsk/
c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t1d0s0 /dev/md/rdsk/bmp1 mtl2.flt2 /dev/rdsk/
c3t1d0s0 /dev/md/rdsk/bmp1 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t2d0s0 /dev/md/rdsk/bmp2 mtl2.flt2 /dev/rdsk/
c3t2d0s0 /dev/md/rdsk/bmp2 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t3d0s0 /dev//rdsk/bmp3 mtl2.flt2 /dev/rdsk/
c3t3d0s0 /dev/md/rdsk/bmp3 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t4d0s0 /dev/md/rdsk/bmp4 mtl2.flt2 /dev/rdsk/
c3t4d0s0 /dev/md/rdsk/bmp4 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t5d0s0 /dev/md/rdsk/bmp5 mtl2.flt2 /dev/rdsk/
c3t5d0s0 /dev/md/rdsk/bmp5 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t6d0s0 /dev/md/rdsk/bmp6 mtl2.flt2 /dev/rdsk/
c3t6d0s0 /dev/md/rdsk/bmp6 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2 /dev/rdsk/
c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2 /dev/rdsk/
c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool
When you state the " initial sync " takes 10-15 minutes to complete,
what did you do to measure this 10-15 minutes?
Do you know that when using I/O consistency groups, one can also mange
all the replicas in groups with a single "-g" group command like:
sndradm -g zfs-pool -nu
2. Bitmaps are on the mirrored metadevice, they are bigger than you
mentioned but this is what dsbitmap shows for volumes:
Not an issue, SNDR ignores what it does not need.
bmp0: Soft Partition
Device: d100
State: Okay
Size: 100441 blocks (49 MB)
Extent Start Block Block count
0 34 100441
3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2
4. Latency:
[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms
I'm emulating on Freebsd the actual delay and the speed for the real
circuit which is 100Mb and 16ms.
See comments above.
5. The queue during writes with the speed 40Mbite/s on the main host:
[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
async_block_hwm 1402834
SNDR's memory and disk queues are adjustable. The two commands are:
sndradm [opts] -F <maxqfbas> [set] set maximum fbas
(blocks) to queue
sndradm [opts] -W <maxwrites> [set] set maximum writes
(items) to queue
These commands set the high-water marks for both number of blocks and
number of items in the memory queue. These are high-water marks, not
hard stops, so it is possible for SNDR to exceed these values base on
current in progress I/Os.
maxqfbas 25000000
maxqitems 16384
I see that maxqfbas has been set, but you also need to increase
maxqitems.
Be-forewarned that guessing a value could have serious memory
implications. The value of 25000000 is in blocks, 512 bytes. This
means that you are allowing SNDR to queue up approximately 12GB of
kernel memory to hold unreplicated data. You may have this much
uncommitted memory sitting idle, but if not, as SNDR tries to consume
this memory, it will impose some serious memory demands on your
system. Fortunately you didn't guess a maxqitems value too.
As shown with either "dsstat -m sndr -d q", or with kstat data, these
value are monitored with runtime counters. The low value of maxqitems,
is likely the item causing the slow performance issues below. Whenever
the replica hits either of these limits, back pressure (flow-control)
is imposed on the application or data services writing to the SNDR
primary volumes.
async_block_hwm 1402834
async_item_hwm 16417
async_queue_blocks 1390128
async_queue_items 16382
In addition to adjusting these memory queue limits to more reasonable
values, the next option is to increase the number of asynchronous
flusher threads form the default of 2. It is applications, filesystems
or databases that fill these SNDR memory or disk queues, the
asynchronous flusher threads empty these queues. Of course the more
threads emptying the queues, increases SNDR's memory, network and CPU
demands to perform replication.
sndradm [opts] -A <asyncthr> [set] set the number of
asynchronous threads
NOTE: If replicating based on I/O consistency groups, increasing one
replica's thread count, increase across all members of the group.
The method for increasing this value is to double the current value
times two, until there is no measured improvement in replication
performance. Then decrease this new value by 1/2.
If it is determined that replication demands still force memory queue
full conditions, the next option is to switch from memory to disk
queues. Of course if on average the amount of change caused by writes
I/Os, exceed the average network replication rate, the neither memory
queues or disk queues will help.
At this point time-fixed replication is your next option. This is
where a full or incremental snapshot is taken at some interval, and
then just the changes between snapshots are replicated. This is
similar as to what ZFS does with snapshot, send and receive, but with
SNDR it is at the block level, where with ZFS its at the filesystem
level.
AVS is both SNDR (remote replication) and II (point in time copy, or
snapshots), and the two data services are integrated so that
incremental replication of snapshots is easy.
The problems:
The problems below are likely due to the too low setting of maxqitems
at 16KB, compounded by running with the default of two asynchronous
flusher threads.
- Jim
1.
In replication mode data transmission on freebsd is only 2.5 Mbite/s
for rcp traffic which is quite lower the numbers netio shows:
----------------- Real link, 100Mb; 16 delay -------------
TCP connection established.
Packet size 1k bytes: 3239 KByte/s Tx, 5885 KByte/s Rx.
2.
When initial synchronization (sndradm -nu) happens it's traffic is
almost zero.
But, if it's gig connection on the switch then the syncro is pretty
fast, maybe dozed of seconds instead of minutes.
3.
But the real problem is that iscsi initiator writings stale because
of sndr replication. The windows initiator just hungs up and can be
only reseted by deleting target on the server.
And async_block_hwm is very high when there is writhing on pool and
gets stuck:
async_block_hwm 1402834
[email protected]:/export/home/roman/zfs# zpool iostat 5
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 3.27K 2.80K
zstor 8.99G 10.9T 0 4 35 451K
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 499 0 61.9M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 21 0 2.67M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 18 0 2.37M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 25 0 3.24M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 18 0 2.27M
---------- ----- ----- ----- ----- ----- -----
rpool 10.6G 63.4G 0 0 0 0
zstor 8.99G 10.9T 0 26 0 3.37M
---------- ----- ----- ----- ----- ----- -----
zpool iostat show some writing but the initiator doesn't respond on
windows box at this time.
And this is what kstat shows:
[email protected]:# kstat sndr:0:setinfo
module: sndr instance: 0
name: setinfo class: storedge
async_block_hwm 1402834
async_item_hwm 16417
async_queue_blocks 1390128
async_queue_items 16382
async_queue_type memory
async_throttle_delay 8135137
autosync 0
bitmap /dev/md/rdsk/bmp0
bitsset 2674
bmpflags 0
bmp_size 5713920
crtime 11243.852414994
disk_status 0
flags 2054
if_down 0
if_rpc_version 7
maxqfbas 25000000
maxqitems 16384
primary_host tor2.flt2
primary_vol /dev/rdsk/c3t0d0s0
secondary_host mtl2.flt2
secondary_vol /dev/rdsk/c3t0d0s0
snaptime 76822.27391977
syncflags 0
syncpos 2925489600
type_flag 5
volsize 2925489887
So, the question for me now if avs available to handle burst
writing and replicate changes over the slow and hight latency link
to the other side without affecting performance on the primary host?
And it should track changes during long periods when the link is down.
If it's still possible to use avs with 2 hosts contain 8x1.36G
volumes, max writing speed is about 30-40 Mb/s and the circuit is
100Mb link with 15-20 ms latency? In async mode obviously
Another question: if link was down for a long time and blocks have
changed couple of time over the downtime - how sndr replicated
sequenced writes?
It doesn't.
SNDR has three modes of operation:
logging mode
(re)synchronization mode
replicating mode
In logging mode and replicating mode, SNDR keeps the both SNDR
primary and secondary volumes in write-consistent (sequenced) order.
During (re)synchronization mode, SNDR updates the secondary volume
in a block (actually bitmap) order. There is only a single bit used
to track differences between the primary and secondary volumes, and
one bit is equal to 32KB.
If the replication environment, volume content and application
availability need be concerned that an SNDR replica is not write-
order consistent during (re)synchronization mode, SNDR supports an
option call ndr_ii, what takes an automatic compact dependent
snapshot of the write-order consistent SNDR secondary volume, and
if in the unlikely case that (re)synchronization mode fails, and
the SNDR primary volume is lost, the snapshot volume can be restore
on the SNDR secondary.
Does it mean that during re(syncronization) mode volumes shouldn't
be mounted because the write order is not consistent?
Roman
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss