Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Bob Friesenhahn

On Wed, 16 Jan 2013, Thomas Nau wrote:


Dear all
I've a question concerning possible performance tuning for both iSCSI access
and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL
SSDs and 128G of main memory

The iSCSI access pattern (1 hour daytime average) looks like the following
(Thanks to Richard Elling for the dtrace script)


If almost all of the I/Os are 4K, maybe your ZVOLs should use a 
volblocksize of 4K?  This seems like the most obvious improvement.


[ stuff removed ]


For disaster recovery we plan to sync the pool as often as possible
to a remote location. Running send/receive after a day or so seems to take
a significant amount of time wading through all the blocks and we hardly
see network average traffic going over 45MB/s (almost idle 1G link).
So here's the question: would increasing/decreasing the volblocksize improve
the send/receive operation and what influence might show for the iSCSI side?


Matching the volume block size to what the clients are actually using 
(due to their filesystem configuration) should improve performance 
during normal operations and should reduce the number of blocks which 
need to be sent in the backup by reducing write amplification due to 
overlap blocks..


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Jim Klimov

On 2013-01-17 16:04, Bob Friesenhahn wrote:

If almost all of the I/Os are 4K, maybe your ZVOLs should use a
volblocksize of 4K?  This seems like the most obvious improvement.



Matching the volume block size to what the clients are actually using
(due to their filesystem configuration) should improve performance
during normal operations and should reduce the number of blocks which
need to be sent in the backup by reducing write amplification due to
overlap blocks..



Also, it would make sense while you are at it to verify that the
clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that
their partitions start at a 512b-based sector offset divisible by
8 inside the virtual HDDs, and the FS headers also align to that
so the first cluster is 4KB-aligned.

Classic MSDOS MBR did not warrant that partition start, by using
63 sectors as the cylinder size and offset factor. Newer OSes don't
use the classic layout, as any config is allowable; and GPT is well
aligned as well.

Overall, a single IO in the VM guest changing a 4KB cluster in its
FS should translate to one 4KB IO in your backend storage changing
the dataset's userdata (without reading a bigger block and modifying
it with COW), plus some avalanche of metadata updates (likely with
the COW) for ZFS's own bookkeeping.

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Ray Arachelian
On 01/16/2013 10:25 PM, Peter Wood wrote:

 Today I started migrating file systems from some old Open Solaris
 servers to these Supermicro boxes and noticed the transfer to one of
 them was going 10x slower then to the other one (like 10GB/hour).

What does dladm show-link show?  I'm guessing one of your links is at
100mbps or at half duplex.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Peter Wood
I have a script that rotates hourly, daily and monthly snapshots. Each
filesystem has about 40 snapshots (zfsList.png - output of 'zfs list | grep
-v home/' -  the home directories datasets are snipped from the output. 4
users in total.)

I noticed that the hourly snapshots on the heaviest filesystem in use are
about 1.2GB in size where on the other system the regular NFS exported
filesystem has about 60MB snapshots (gallerySnapshots.png - output of
command 'zfs list -t snapshot -r pool01/utils/gallery')

I know that the gallery FS is in heavier use then normal but I was told it
will be mostly reading and based on the iostat seems that there is heavy
writing too.

I guess I'll schedule some downtime and disable gallery export and see if
that will effect the number of write operations and performance in general.

Unless there is some other way to test what/where these write operations
are applied.

The 'zpool iostat -v' output is uncomfortably static. The values of
read/write operations and bandwidth are the same for hours and even days.
I'd expect at least some variations between morning and night. The load on
the servers is different for sure. Any input?

Thanks,

-- Peter


On Wed, Jan 16, 2013 at 7:49 PM, Bob Friesenhahn 
bfrie...@simple.dallas.tx.us wrote:

 On Wed, 16 Jan 2013, Peter Wood wrote:


 Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write
 operations on the drives and 661 on the
 ZIL. Compare to the other server (who is in way heavier use then this
 one) these numbers are extremely high.

 Any idea how to debug any further?


 Do some filesystems contain many snapshots?  Do some filesystems use small
 zfs block sizes.  Have the servers been used the same?

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/**
 users/bfriesen/ http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Bob Friesenhahn

On Thu, 17 Jan 2013, Peter Wood wrote:


Unless there is some other way to test what/where these write operations are 
applied.


You can install Brendan Gregg's DTraceToolkit and use it to find out 
who and what is doing all the writing.  1.2GB in an hour is quite a 
lot of writing.  If this is going continuously, then it may be causing 
more fragmentation in conjunction with your snapshots.


See http://www.brendangregg.com/dtrace.html;.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Timothy Coalson
On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood peterwood...@gmail.com wrote:


 The 'zpool iostat -v' output is uncomfortably static. The values of
 read/write operations and bandwidth are the same for hours and even days.
 I'd expect at least some variations between morning and night. The load on
 the servers is different for sure. Any input?


Without a repetition time parameter, zpool iostat will print exactly once
and exit, and the output is an average from kernel boot to now, just like
iostat, this is why it seems so static.  If you want to know the activity
over 5 second intervals, use something like zpool iostat -v 5 (repeat
every 5 seconds) and wait for the second and later blocks.  The second and
later blocks are average from previous output until now.  I generally use
5 second intervals to match the 5 second commit interval on my pools.

Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Jim Klimov

On 2013-01-18 00:42, Bob Friesenhahn wrote:

You can install Brendan Gregg's DTraceToolkit and use it to find out who
and what is doing all the writing.  1.2GB in an hour is quite a lot of
writing.  If this is going continuously, then it may be causing more
fragmentation in conjunction with your snapshots.


As a moderately wild guess, since you're speaking of galleries,
are these problematic filesystems often-read? By default ZFS
updates the last access-time of files it reads, as do many other
filesystems, and this causes avalanches of metadata updates -
sync writes (likely) as well as fragmentation. This may also
be a poorly traceable but considerable used space in frequent
snapshots. You can verify (and unset) this behaviour with the
ZFS FS dataset property atime, i.e.:

# zfs get atime pond/export/home
NAME  PROPERTY  VALUE  SOURCE
pond/export/home  atime offinherited from pond

On another hand, verify where your software keeps the temporary
files (i.e. during uploads as may be with galleries). Again, if
this is a frequently snapshotted dataset (though 1 hour is not
really that frequent) then needless temp files can be held by
those older snapshots. Moving such temporary works to a different
dataset with a different snapshot schedule and/or to a different
pool (to keep related fragmentation constrained) may prove useful.

HTH,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Peter Wood
Right on Tim. Thanks. I didn't know that. I'm sure it's documented
somewhere and I should have read it so double thanks for explaining it.


On Thu, Jan 17, 2013 at 4:18 PM, Timothy Coalson tsc...@mst.edu wrote:

 On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood peterwood...@gmail.comwrote:


 The 'zpool iostat -v' output is uncomfortably static. The values of
 read/write operations and bandwidth are the same for hours and even days.
 I'd expect at least some variations between morning and night. The load on
 the servers is different for sure. Any input?


 Without a repetition time parameter, zpool iostat will print exactly once
 and exit, and the output is an average from kernel boot to now, just like
 iostat, this is why it seems so static.  If you want to know the activity
 over 5 second intervals, use something like zpool iostat -v 5 (repeat
 every 5 seconds) and wait for the second and later blocks.  The second and
 later blocks are average from previous output until now.  I generally use
 5 second intervals to match the 5 second commit interval on my pools.

 Tim


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Peter Wood
Great points Jim. I have requested more information how the gallery share
is being used and any temporary data will be moved out of there.

About atime, it is set to on right now and I've considered to turn it off
but I wasn't sure if this will effect incremental zfs send/receive.

'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right?


On Thu, Jan 17, 2013 at 4:34 PM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-01-18 00:42, Bob Friesenhahn wrote:

 You can install Brendan Gregg's DTraceToolkit and use it to find out who
 and what is doing all the writing.  1.2GB in an hour is quite a lot of
 writing.  If this is going continuously, then it may be causing more
 fragmentation in conjunction with your snapshots.


 As a moderately wild guess, since you're speaking of galleries,
 are these problematic filesystems often-read? By default ZFS
 updates the last access-time of files it reads, as do many other
 filesystems, and this causes avalanches of metadata updates -
 sync writes (likely) as well as fragmentation. This may also
 be a poorly traceable but considerable used space in frequent
 snapshots. You can verify (and unset) this behaviour with the
 ZFS FS dataset property atime, i.e.:

 # zfs get atime pond/export/home
 NAME  PROPERTY  VALUE  SOURCE
 pond/export/home  atime offinherited from pond

 On another hand, verify where your software keeps the temporary
 files (i.e. during uploads as may be with galleries). Again, if
 this is a frequently snapshotted dataset (though 1 hour is not
 really that frequent) then needless temp files can be held by
 those older snapshots. Moving such temporary works to a different
 dataset with a different snapshot schedule and/or to a different
 pool (to keep related fragmentation constrained) may prove useful.

 HTH,
 //Jim Klimov


 __**_
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/**mailman/listinfo/zfs-discusshttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Richard Elling
On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:

 On Wed, 16 Jan 2013, Thomas Nau wrote:
 
 Dear all
 I've a question concerning possible performance tuning for both iSCSI access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM 
 ZIL
 SSDs and 128G of main memory
 
 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)
 
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize 
 of 4K?  This seems like the most obvious improvement.

4k might be a little small. 8k will have less metadata overhead. In some cases
we've seen good performance on these workloads up through 32k. Real pain
is felt at 128k :-)

 
 [ stuff removed ]
 
 For disaster recovery we plan to sync the pool as often as possible
 to a remote location. Running send/receive after a day or so seems to take
 a significant amount of time wading through all the blocks and we hardly
 see network average traffic going over 45MB/s (almost idle 1G link).
 So here's the question: would increasing/decreasing the volblocksize improve
 the send/receive operation and what influence might show for the iSCSI side?
 
 Matching the volume block size to what the clients are actually using (due to 
 their filesystem configuration) should improve performance during normal 
 operations and should reduce the number of blocks which need to be sent in 
 the backup by reducing write amplification due to overlap blocks..

compression is a good win, too 
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Richard Elling

On Jan 17, 2013, at 8:35 AM, Jim Klimov jimkli...@cos.ru wrote:

 On 2013-01-17 16:04, Bob Friesenhahn wrote:
 If almost all of the I/Os are 4K, maybe your ZVOLs should use a
 volblocksize of 4K?  This seems like the most obvious improvement.
 
 Matching the volume block size to what the clients are actually using
 (due to their filesystem configuration) should improve performance
 during normal operations and should reduce the number of blocks which
 need to be sent in the backup by reducing write amplification due to
 overlap blocks..
 
 
 Also, it would make sense while you are at it to verify that the
 clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that
 their partitions start at a 512b-based sector offset divisible by
 8 inside the virtual HDDs, and the FS headers also align to that
 so the first cluster is 4KB-aligned.

This is the classical expectation. So I added an alignment check into
nfssvrtop and iscsisvrtop. I've looked at a *ton* of NFS workloads from
ESX and, believe it or not, alignment doesn't matter at all, at least for 
the data I've collected. I'll let NetApp wallow in the mire of misalignment
while I blissfully dream of other things :-)

 Classic MSDOS MBR did not warrant that partition start, by using
 63 sectors as the cylinder size and offset factor. Newer OSes don't
 use the classic layout, as any config is allowable; and GPT is well
 aligned as well.
 
 Overall, a single IO in the VM guest changing a 4KB cluster in its
 FS should translate to one 4KB IO in your backend storage changing
 the dataset's userdata (without reading a bigger block and modifying
 it with COW), plus some avalanche of metadata updates (likely with
 the COW) for ZFS's own bookkeeping.

I've never seen a 1:1 correlation from the VM guest to the workload
on the wire. To wit, I did a bunch of VDI and VDI-like (small, random
writes) testing on XenServer and while the clients were chugging
away doing 4K random I/Os, on the wire I was seeing 1MB NFS
writes. In part this analysis led to my cars-and-trains analysis.

In some VMware configurations, over the wire you could see a 16k
read for every 4k random write. Go figure. Fortunately, those 16k 
reads find their way into the MFU side of the ARC :-)

Bottom line: use tools like iscsisvrtop and dtrace to get an idea of
what is really happening over the wire.
 -- richard

--

richard.ell...@richardelling.com
+1-760-896-4422









___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Bob Friesenhahn

On Thu, 17 Jan 2013, Peter Wood wrote:


Great points Jim. I have requested more information how the gallery share is 
being used and any temporary data will
be moved out of there.
About atime, it is set to on right now and I've considered to turn it off but 
I wasn't sure if this will effect
incremental zfs send/receive.

'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right?


Zfs send does not care about atime.  The access time is useless other 
than as a way to see how long it has been since a file was accessed.


For local access (not true for NFS), Zfs is lazy about updating atime 
on disk and so it may not be updated on disk until the next 
transaction group is written (e.g. up to 5 seconds) and so it does not 
represent much actual load.  Without this behavior, the system could 
become unusable.


For NFS you should disable atime on the NFS client mounts.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Heavy write IO for no apparent reason

2013-01-17 Thread Bob Friesenhahn

On Thu, 17 Jan 2013, Bob Friesenhahn wrote:


For NFS you should disable atime on the NFS client mounts.


This advice was wrong.  It needs to be done on the server side.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iSCSI access patterns and possible improvements?

2013-01-17 Thread Thomas Nau
Thanks for all the answers more inline)

On 01/18/2013 02:42 AM, Richard Elling wrote:
 On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 mailto:bfrie...@simple.dallas.tx.us wrote:
 
 On Wed, 16 Jan 2013, Thomas Nau wrote:

 Dear all
 I've a question concerning possible performance tuning for both iSCSI access
 and replicating a ZVOL through zfs send/receive. We export ZVOLs with the
 default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI.
 The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM 
 ZIL
 SSDs and 128G of main memory

 The iSCSI access pattern (1 hour daytime average) looks like the following
 (Thanks to Richard Elling for the dtrace script)

 If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize 
 of 4K?  This seems like the most obvious improvement.
 
 4k might be a little small. 8k will have less metadata overhead. In some cases
 we've seen good performance on these workloads up through 32k. Real pain
 is felt at 128k :-)

My only pain so far is the time a send/receive takes without really loading the
network at all. VM performance is nothing I worry about at all as it's pretty 
good.
So key question for me is if going from 8k to 16k or even 32k would have some 
benefit for
that problem?


 

 [ stuff removed ]

 For disaster recovery we plan to sync the pool as often as possible
 to a remote location. Running send/receive after a day or so seems to take
 a significant amount of time wading through all the blocks and we hardly
 see network average traffic going over 45MB/s (almost idle 1G link).
 So here's the question: would increasing/decreasing the volblocksize improve
 the send/receive operation and what influence might show for the iSCSI side?

 Matching the volume block size to what the clients are actually using (due 
 to their filesystem configuration) should improve
 performance during normal operations and should reduce the number of blocks 
 which need to be sent in the backup by reducing
 write amplification due to overlap blocks..
 
 compression is a good win, too 

Thanks for that. I'll use your mentioned tools to drill down

  -- richard

Thomas

 
 --
 
 richard.ell...@richardelling.com mailto:richard.ell...@richardelling.com
 +1-760-896-4422
 
 
 
 
 
 
 
 
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss