Jim Dunham wrote, On 04/24/2009 06:46 PM:
Roman,
Thanks for your help!
Actually I have some more questions. I need to make a decision on
replication mode for our storages: zfs send-receive, avs or even
microsoft internal tool on the iscsi volumes with independent zfs
snashots on both side.
Initially avs seemed to me a good options, but I can't make it
working on 100Mb link with 8x1.36Tb volumes
It's strange that sndr sends entire bitmap - what if one is for a
big replicated volumes like for 1.36Gb? It's more that 100000
blocks for async replication, .
There will be constant timeouts on average 100M link in this case.
SNDR does not replicate the bitmap volume, just the bitmap itself.
There is one bit per 32KB of primary volume size, with 8 bits per
byte, and 512 bytes per block. The answer for 1.36GB is just 11.04
blocks, or 5.5KB.
Of course looking at the example below, the math for replicating TBs
verses GBs is 1024 times larger.
1.36 TB * (1 bit / 32KB) * (8 bytes / bit) * (1 block / 512 bytes) =
11161 blocks, which is the value reported below for non-disk queue
replication with SNDR.
But the dsbitmap shows 100441 blocks for async replication, I'm
missing something?
Yes you did, the words "disk queue". When replicating with a disk
queue, there is an addition requirement for storing 1-byte or 4-byte
reference counter per bit. These reference counters are separate from
the actual bitmap, and are not exchanged between SNDR primary and
secondary nodes.
Required bitmap volume size:
Sync replication: 11161 blocks
Async replication with memory queue: 11161 blocks
Async replication with disk queue: 100441 blocks
Async replication with disk queue and 32 bit refcount: 368281 blocks
Good. I kind of figure that this was the problem. What are is your
SNDR primary volume size?
After initial sync started to work (although it's a very slow process
and it takes 10-15 mins to complete) I have the following situation:
Below you made reference to using 'ping' with a result of "64 bytes
from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms". It would be
interesting to know the results of "ping -s mtl2.flt2 8192", where
8192 is the chunk size for exchanging bitmaps.
The same with this size in both sides.
The reason I mention this is that "64 bytes / 16.822 ms" is ~3.7
KB/sec. With 8 * 11161 blocks * 512 bytes, it would take ~25 minutes
to exchange bitmaps with the level of link latency and constrained
bandwidth your are testing with.
It should have at least the same speed as for replication itself: 2.5-3
Mbite/s
But definitely the latency causes the problem, with low latency it
initializes very quick.
1. Storage (8x1.36Tb in one raidz2 pool)
[email protected]# sndradm -i
tor2.flt2 /dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 mtl2.flt2
/dev/rdsk/c3t0d0s0 /dev/md/rdsk/bmp0 ip async g zfs-pool
.................
tor2.flt2 /dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 mtl2.flt2
/dev/rdsk/c3t7d0s0 /dev/md/rdsk/bmp7 ip async g zfs-pool
tor2.flt2 /dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 mtl2.flt2
/dev/rdsk/c4t1d0s0 /dev/md/rdsk/bmp8 ip async g zfs-pool
When you state the " initial sync " takes 10-15 minutes to complete,
what did you do to measure this 10-15 minutes?
Do you know that when using I/O consistency groups, one can also mange
all the replicas in groups with a single "-g" group command like:
sndradm -g zfs-pool -nu
Yes, I did replication for I/O group everywhere.
2. Bitmaps are on the mirrored metadevice, they are bigger than you
mentioned but this is what dsbitmap shows for volumes:
Not an issue, SNDR ignores what it does not need.
bmp0: Soft Partition
Device: d100
State: Okay
Size: 100441 blocks (49 MB)
Extent & nbsp; Start Block &n bsp; Block count
0 34 ; 100441
3. Network:
tor2.flt2 <--> freebsd router < --->mtl2.flt2
4. Latency:
[email protected]:# ping -s mtl2.flt2
PING 172.0.5.10: 56 data bytes
64 bytes from mtl2.flt2 (172.0.5.10): icmp_seq=0. time=16.822 ms
I'm emulating on Freebsd the actual delay and the speed for the real
circuit which is 100Mb and 16ms.
See comments above.
5. The queue during writes with the speed 40Mbite/s on the main host:
[email protected]:/# kstat sndr::setinfo | grep async_block_hwm
async_block_hwm &nbs p; 1402834
.................
async_block_hwm &nbs p; 1402834
SNDR's memory and disk queues are adjustable. The two commands are:
sndradm [opts] -F <maxqfbas> [set] set maximum fbas
(blocks) to queue
sndradm [opts] -W <maxwrites> [set] set maximum writes
(items) to queue
These commands set the high-water marks for both number of blocks and
number of items in the memory queue. These are high-water marks, not
hard stops, so it is possible for SNDR to exceed these values base on
current in progress I/Os.
maxqfbas 25000000
maxqitems & nbsp; 16384
I see that maxqfbas has been set, but you also need to increase maxqitems.
Be-forewarned that guessing a value could have serious memory
implications. The value of 25000000 is in blocks, 512 bytes. This
means that you are allowing SNDR to queue up approximately 12GB of
kernel memory to hold unreplicated data. You may have this much
uncommitted memory sitting idle, but if not, as SNDR tries to consume
this memory, it will impose some serious memory demands on your
system. Fortunately you didn't guess a maxqitems value too.
I managed to hang up server with too big queue. It didn't respond to any
queries and keys pressing but when sndr went to logging by unplugged
network it returned to life :)
As shown with either "dsstat -m sndr -d q", or with kstat data, these
value are monitored with runtime counters. The low value of maxqitems,
is likely the item causing the slow performance issues below. Whenever
the replica hits either of these limits, back pressure (flow-control)
is imposed on the application or data services writing to the SNDR
primary volumes.
async_block_hwm &nbs p; 1402834
async_item_hwm ; 16417
async_queue_blocks & nbsp; 1390128
async_queue_items &n bsp; 16382
In addition to adjusting these memory queue limits to more reasonable
values, the next option is to increase the number of asynchronous
flusher threads form the default of 2. It is applications, filesystems
or databases that fill these SNDR memory or disk queues,
the asynchronous flusher threads empty these queues. Of course the
more threads emptying the queues, increases SNDR's memory, network and
CPU demands to perform replication.
sndradm [opts] -A <asyncthr> [set] set the number of
asynchronous threads
NOTE: If replicating based on I/O consistency groups, increasing one
replica's thread count, increase across all members of the group.
The method for increasing this value is to double the current value
times two, until there is no measured improvement in replication
performance. Then decrease this new value by 1/2.
If it is determined that replication demands still force memory queue
full conditions, the next option is to switch from memory to disk
queues. Of course if on average the amount of change caused by writes
I/Os, exceed the average network replication rate, the neither memory
queues or disk queues will help.
That's what I figured out today. The queue despite how big it was filled
up and initiators just stopped to work (windows completely and Linux
continued to work after queue lowered).
I played with mentioned options, but definitely the problem was not
related to queues tuning, I just need another type of replication.
The speed increased a little when I changed number of threads to 4, but
still queue filled up. On the 100Mbit I got 25Mbit/s synchronization
speed (replication mode)
At this point time-fixed replication is your next option. This is
where a full or incremental snapshot is taken at some interval, and
then just the changes between snapshots are replicated. This is
similar as to what ZFS does with snapshot, send and receive, but with
SNDR it is at the block level, where with ZFS its at the filesystem level.
AVS is both SNDR (remote replication) and II (point in time copy, or
snapshots), and the two data services are integrated so that
incremental replication of snapshots is easy.
Well, I'll try this II in the near future and post results here, in case
somebody else is interested.
And I wanted to ask you if there another software from SUN that can be
used for such task - replication on high latency links?
Roman
_______________________________________________
storage-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/storage-discuss