Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-20 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 Sent: Saturday, January 19, 2013 5:39 PM
 
 the space allocation more closely resembles a variant
 of mirroring,
 like some vendors call RAID-1E

Awesome, thank you.   :-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.

Oh, I forgot to mention - The above logic only makes sense for mirrors and 
stripes.  Not for raidz (or raid-5/6/dp in general)

If you have a pool of mirrors or stripes, the system isn't forced to subdivide 
a 4k block onto multiple disks, so it works very well.  But if you have a pool 
blocksize of 4k and let's say a 5-disk raidz (capacity of 4 disks) then the 4k 
block gets divided into 1k on each disk and 1k parity on the parity disk.  Now, 
since the hardware only supports block sizes of 4k ... You can see there's a 
lot of wasted space, and if you do a bunch of it, you'll also have a lot of 
wasted time waiting for seeks/latency.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Richard Elling
On Jan 19, 2013, at 7:16 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 Oh, I forgot to mention - The above logic only makes sense for mirrors and 
 stripes.  Not for raidz (or raid-5/6/dp in general)
 
 If you have a pool of mirrors or stripes, the system isn't forced to 
 subdivide a 4k block onto multiple disks, so it works very well.  But if you 
 have a pool blocksize of 4k and let's say a 5-disk raidz (capacity of 4 
 disks) then the 4k block gets divided into 1k on each disk and 1k parity on 
 the parity disk.  Now, since the hardware only supports block sizes of 4k ... 
 You can see there's a lot of wasted space, and if you do a bunch of it, 
 you'll also have a lot of wasted time waiting for seeks/latency.

This is not quite true for raidz. If there is a 4k write to a raidz comprised 
of 4k sector disks, then
there will be one data and one parity block. There will not be 4 data + 1 
parity with 75% 
space wastage. Rather, the space allocation more closely resembles a variant of 
mirroring,
like some vendors call RAID-1E
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-19 Thread Jim Klimov

On 2013-01-19 23:39, Richard Elling wrote:

This is not quite true for raidz. If there is a 4k write to a raidz
comprised of 4k sector disks, then
there will be one data and one parity block. There will not be 4 data +
1 parity with 75%
space wastage. Rather, the space allocation more closely resembles a
variant of mirroring,
like some vendors call RAID-1E


I agree with this exact reply, but as I posted sometime late last year,
reporting on my digging in the bowels of ZFS and my problematic pool,
for a 6-disk raidz2 set I only saw allocations (including two parity
disks) divisible by 3 sectors, even if the amount of the (compressed)
userdata was not so rounded. I.e. I had either miniature files or tails
of files fitting into one sector plus two parities (overall a 3 sector
allocation), or tails ranging 2-4 sectors and occupying 6 with parity
(while 2 or 3 sectors could use just 4 or 5 w/parities, respectively).

I am not sure what these numbers mean - 3 being a case for one userdata
sector plus both parities or for half of 6-disk stripe - both such
explanations fit in my case.

But yes, with current raidz allocation there are many ways to waste
space. And those small percentages (or not so small) do add up.
Rectifying this example, i.e. allocating only as much as is used,
does not seem like an incompatible on-disk format change, and should
be doable within the write-queue logic. Maybe it would cause tradeoffs
in efficiency; however, ZFS does explicitly rotate starting disks
of allocations every few megabytes in order to even out the loads
among spindles (normally parity disks don't have to be accessed -
unless mismatches occur on data disks). Disabling such padding would
only help achieve this goal and save space at the same time...

My 2c,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-18 Thread Jim Klimov

On 2013-01-18 06:35, Thomas Nau wrote:

If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 
4K?  This seems like the most obvious improvement.


4k might be a little small. 8k will have less metadata overhead. In some cases
we've seen good performance on these workloads up through 32k. Real pain
is felt at 128k :-)


My only pain so far is the time a send/receive takes without really loading the
network at all. VM performance is nothing I worry about at all as it's pretty 
good.
So key question for me is if going from 8k to 16k or even 32k would have some 
benefit for
that problem?


I would guess that increasing the block size would on one hand improve
your reads - due to more userdata being stored contiguously as part of
one ZFS block - and thus sending of the backup streams should be more
about reading and sending the data and less about random seeking.

On the other hand, this may likely be paid off with the need to do more
read-modify-writes (when larger ZFS blocks are partially updated with
the smaller clusters in the VM's filesystem) while the overall system
is running and used for its primary purpose. However, since the guest
FS is likely to store files of non-minimal size, it is likely that the
whole larger backend block would be updated anyway...

So, I think, this is something an experiment can show you - whether the
gain during backup (and primary-job) reads vs. possible degradation
during the primary-job writes would be worth it.

As for the experiment, I guess you can always make a ZVOL with different
recordsize, DD data into it from the production dataset's snapshot, and
attach the VM or its clone to the newly created clone of its disk image.

Good luck, and I hope I got Richard's logic right in that answer ;)
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-18 Thread Richard Elling

On Jan 17, 2013, at 9:35 PM, Thomas Nau thomas@uni-ulm.de wrote:

 Thanks for all the answers more inline)
 
 On 01/18/2013 02:42 AM, Richard Elling wrote:
 On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 mailto:bfrie...@simple.dallas.tx.us wrote:
 
 On Wed, 16 Jan 2013, Thomas Nau wrote:
 
 Dear all
 I've a question concerning possible performance tuning for both iSCSI 
 access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC 
 RAM ZIL
 SSDs and 128G of main memory
 
 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 4k might be a little small. 8k will have less metadata overhead. In some 
 cases
 we've seen good performance on these workloads up through 32k. Real pain
 is felt at 128k :-)
 
 My only pain so far is the time a send/receive takes without really loading 
 the
 network at all. VM performance is nothing I worry about at all as it's pretty 
 good.
 So key question for me is if going from 8k to 16k or even 32k would have some 
 benefit for
 that problem?

send/receive can bottleneck on the receiving side. Take a look at the archives
searching for mbuffer as a method of buffering on the receive side. In a well
tuned system, the send will be from ARC :-)
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-18 Thread Richard Elling
On Jan 18, 2013, at 4:40 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-01-18 06:35, Thomas Nau wrote:
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 4k might be a little small. 8k will have less metadata overhead. In some 
 cases
 we've seen good performance on these workloads up through 32k. Real pain
 is felt at 128k :-)
 
 My only pain so far is the time a send/receive takes without really loading 
 the
 network at all. VM performance is nothing I worry about at all as it's 
 pretty good.
 So key question for me is if going from 8k to 16k or even 32k would have 
 some benefit for
 that problem?
 
 I would guess that increasing the block size would on one hand improve
 your reads - due to more userdata being stored contiguously as part of
 one ZFS block - and thus sending of the backup streams should be more
 about reading and sending the data and less about random seeking.

There is too much caching in the datapath to make a broad statement stick.
Empirical measurements with your workload will need to choose the winner.

 On the other hand, this may likely be paid off with the need to do more
 read-modify-writes (when larger ZFS blocks are partially updated with
 the smaller clusters in the VM's filesystem) while the overall system
 is running and used for its primary purpose. However, since the guest
 FS is likely to store files of non-minimal size, it is likely that the
 whole larger backend block would be updated anyway...

For many ZFS implementations, RMW for zvols is the norm.

 
 So, I think, this is something an experiment can show you - whether the
 gain during backup (and primary-job) reads vs. possible degradation
 during the primary-job writes would be worth it.
 
 As for the experiment, I guess you can always make a ZVOL with different
 recordsize, DD data into it from the production dataset's snapshot, and
 attach the VM or its clone to the newly created clone of its disk image.

In my experience, it is very hard to recreate in the lab the environments
found in real life. dd, in particular, will skew the results a bit because it
is in LBA order for zvols, not the creation order as seen in the real world.

That said, trying to get high performance out of HDDs is an exercise like
fighting the tides :-)
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Bob Friesenhahn

On Wed, 16 Jan 2013, Thomas Nau wrote:


Dear all
I've a question concerning possible performance tuning for both iSCSI access
and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL
SSDs and 128G of main memory

The iSCSI access pattern (1 hour daytime average) looks like the following
(Thanks to Richard Elling for the dtrace script)


If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
volblocksize of 4K?  This seems like the most obvious improvement.


[ stuff removed ]


For disaster recovery we plan to sync the pool as often as possible
to a remote location. Running send/receive after a day or so seems to take
a significant amount of time wading through all the blocks and we hardly
see network average traffic going over 45MB/s (almost idle 1G link).
So here's the question: would increasing/decreasing the volblocksize improve
the send/receive operation and what influence might show for the iSCSI side?


Matching the volume block size to what the clients are actually using 
(due to their filesystem configuration) should improve performance 
during normal operations and should reduce the number of blocks which 
need to be sent in the backup by reducing write amplification due to 
overlap blocks..


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Jim Klimov

On 2013-01-17 16:04, Bob Friesenhahn wrote:

If almost all of the I/Os are 4K, maybe your ZVOLs should use a
volblocksize of 4K?  This seems like the most obvious improvement.



Matching the volume block size to what the clients are actually using
(due to their filesystem configuration) should improve performance
during normal operations and should reduce the number of blocks which
need to be sent in the backup by reducing write amplification due to
overlap blocks..



Also, it would make sense while you are at it to verify that the
clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that
their partitions start at a 512b-based sector offset divisible by
8 inside the virtual HDDs, and the FS headers also align to that
so the first cluster is 4KB-aligned.

Classic MSDOS MBR did not warrant that partition start, by using
63 sectors as the cylinder size and offset factor. Newer OSes don't
use the classic layout, as any config is allowable; and GPT is well
aligned as well.

Overall, a single IO in the VM guest changing a 4KB cluster in its
FS should translate to one 4KB IO in your backend storage changing
the dataset's userdata (without reading a bigger block and modifying
it with COW), plus some avalanche of metadata updates (likely with
the COW) for ZFS's own bookkeeping.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Richard Elling
On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:

 On Wed, 16 Jan 2013, Thomas Nau wrote:
 
 Dear all
 I've a question concerning possible performance tuning for both iSCSI access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM 
 ZIL
 SSDs and 128G of main memory
 
 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize 
 of 4K?  This seems like the most obvious improvement.

4k might be a little small. 8k will have less metadata overhead. In some cases
we've seen good performance on these workloads up through 32k. Real pain
is felt at 128k :-)

 
 [ stuff removed ]
 
 For disaster recovery we plan to sync the pool as often as possible
 to a remote location. Running send/receive after a day or so seems to take
 a significant amount of time wading through all the blocks and we hardly
 see network average traffic going over 45MB/s (almost idle 1G link).
 So here's the question: would increasing/decreasing the volblocksize improve
 the send/receive operation and what influence might show for the iSCSI side?
 
 Matching the volume block size to what the clients are actually using (due to 
 their filesystem configuration) should improve performance during normal 
 operations and should reduce the number of blocks which need to be sent in 
 the backup by reducing write amplification due to overlap blocks..

compression is a good win, too 
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Richard Elling

On Jan 17, 2013, at 8:35 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-01-17 16:04, Bob Friesenhahn wrote:
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 Matching the volume block size to what the clients are actually using
 (due to their filesystem configuration) should improve performance
 during normal operations and should reduce the number of blocks which
 need to be sent in the backup by reducing write amplification due to
 overlap blocks..
 
 
 Also, it would make sense while you are at it to verify that the
 clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that
 their partitions start at a 512b-based sector offset divisible by
 8 inside the virtual HDDs, and the FS headers also align to that
 so the first cluster is 4KB-aligned.

This is the classical expectation. So I added an alignment check into
nfssvrtop and iscsisvrtop. I've looked at a *ton* of NFS workloads from
ESX and, believe it or not, alignment doesn't matter at all, at least for 
the data I've collected. I'll let NetApp wallow in the mire of misalignment
while I blissfully dream of other things :-)

 Classic MSDOS MBR did not warrant that partition start, by using
 63 sectors as the cylinder size and offset factor. Newer OSes don't
 use the classic layout, as any config is allowable; and GPT is well
 aligned as well.
 
 Overall, a single IO in the VM guest changing a 4KB cluster in its
 FS should translate to one 4KB IO in your backend storage changing
 the dataset's userdata (without reading a bigger block and modifying
 it with COW), plus some avalanche of metadata updates (likely with
 the COW) for ZFS's own bookkeeping.

I've never seen a 1:1 correlation from the VM guest to the workload
on the wire. To wit, I did a bunch of VDI and VDI-like (small, random
writes) testing on XenServer and while the clients were chugging
away doing 4K random I/Os, on the wire I was seeing 1MB NFS
writes. In part this analysis led to my cars-and-trains analysis.

In some VMware configurations, over the wire you could see a 16k
read for every 4k random write. Go figure. Fortunately, those 16k 
reads find their way into the MFU side of the ARC :-)

Bottom line: use tools like iscsisvrtop and dtrace to get an idea of
what is really happening over the wire.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Thomas Nau
Thanks for all the answers more inline)

On 01/18/2013 02:42 AM, Richard Elling wrote:
 On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 mailto:bfrie...@simple.dallas.tx.us wrote:
 
 On Wed, 16 Jan 2013, Thomas Nau wrote:

 Dear all
 I've a question concerning possible performance tuning for both iSCSI access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM 
 ZIL
 SSDs and 128G of main memory

 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)

 If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize 
 of 4K?  This seems like the most obvious improvement.
 
 4k might be a little small. 8k will have less metadata overhead. In some cases
 we've seen good performance on these workloads up through 32k. Real pain
 is felt at 128k :-)

My only pain so far is the time a send/receive takes without really loading the
network at all. VM performance is nothing I worry about at all as it's pretty 
good.
So key question for me is if going from 8k to 16k or even 32k would have some 
benefit for
that problem?


 

 [ stuff removed ]

 For disaster recovery we plan to sync the pool as often as possible
 to a remote location. Running send/receive after a day or so seems to take
 a significant amount of time wading through all the blocks and we hardly
 see network average traffic going over 45MB/s (almost idle 1G link).
 So here's the question: would increasing/decreasing the volblocksize improve
 the send/receive operation and what influence might show for the iSCSI side?

 Matching the volume block size to what the clients are actually using (due 
 to their filesystem configuration) should improve
 performance during normal operations and should reduce the number of blocks 
 which need to be sent in the backup by reducing
 write amplification due to overlap blocks..
 
 compression is a good win, too 

Thanks for that. I'll use your mentioned tools to drill down

  -- richard

Thomas

 
 --
 
 richard.ell...@richardelling.com mailto:richard.ell...@richardelling.com
 +1-760-896-4422
 
 
 
 
 
 
 
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-16 Thread Thomas Nau
Dear all
I've a question concerning possible performance tuning for both iSCSI access
and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL
SSDs and 128G of main memory

The iSCSI access pattern (1 hour daytime average) looks like the following
(Thanks to Richard Elling for the dtrace script)



  R
   value  - Distribution - count
 256 | 0
 512 |@22980
1024 | 663
2048 | 1075
4096 |@433819
8192 |@@   40876
   16384 |@@   37218
   32768 |@82584
   65536 |@@   34784
  131072 |@25968
  262144 |@14884
  524288 | 69
 1048576 | 0

  W
   value  - Distribution - count
 256 | 0
 512 |@35961
1024 | 25108
2048 | 10222
4096 |@@@  1243634
8192 |@521519
   16384 | 218932
   32768 |@@@  146519
   65536 | 112
  131072 | 15
  262144 | 78
  524288 | 0

For disaster recovery we plan to sync the pool as often as possible
to a remote location. Running send/receive after a day or so seems to take
a significant amount of time wading through all the blocks and we hardly
see network average traffic going over 45MB/s (almost idle 1G link).
So here's the question: would increasing/decreasing the volblocksize improve
the send/receive operation and what influence might show for the iSCSI side?

Thanks for any help
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss