Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Wed, 16 Jan 2013, Thomas Nau wrote: Dear all I've a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. [ stuff removed ] For disaster recovery we plan to sync the pool as often as possible to a remote location. Running send/receive after a day or so seems to take a significant amount of time wading through all the blocks and we hardly see network average traffic going over 45MB/s (almost idle 1G link). So here's the question: would increasing/decreasing the volblocksize improve the send/receive operation and what influence might show for the iSCSI side? Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-17 16:04, Bob Friesenhahn wrote: If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. Also, it would make sense while you are at it to verify that the clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that their partitions start at a 512b-based sector offset divisible by 8 inside the virtual HDDs, and the FS headers also align to that so the first cluster is 4KB-aligned. Classic MSDOS MBR did not warrant that partition start, by using 63 sectors as the cylinder size and offset factor. Newer OSes don't use the classic layout, as any config is allowable; and GPT is well aligned as well. Overall, a single IO in the VM guest changing a 4KB cluster in its FS should translate to one 4KB IO in your backend storage changing the dataset's userdata (without reading a bigger block and modifying it with COW), plus some avalanche of metadata updates (likely with the COW) for ZFS's own bookkeeping. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On 01/16/2013 10:25 PM, Peter Wood wrote: Today I started migrating file systems from some old Open Solaris servers to these Supermicro boxes and noticed the transfer to one of them was going 10x slower then to the other one (like 10GB/hour). What does dladm show-link show? I'm guessing one of your links is at 100mbps or at half duplex. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
I have a script that rotates hourly, daily and monthly snapshots. Each filesystem has about 40 snapshots (zfsList.png - output of 'zfs list | grep -v home/' - the home directories datasets are snipped from the output. 4 users in total.) I noticed that the hourly snapshots on the heaviest filesystem in use are about 1.2GB in size where on the other system the regular NFS exported filesystem has about 60MB snapshots (gallerySnapshots.png - output of command 'zfs list -t snapshot -r pool01/utils/gallery') I know that the gallery FS is in heavier use then normal but I was told it will be mostly reading and based on the iostat seems that there is heavy writing too. I guess I'll schedule some downtime and disable gallery export and see if that will effect the number of write operations and performance in general. Unless there is some other way to test what/where these write operations are applied. The 'zpool iostat -v' output is uncomfortably static. The values of read/write operations and bandwidth are the same for hours and even days. I'd expect at least some variations between morning and night. The load on the servers is different for sure. Any input? Thanks, -- Peter On Wed, Jan 16, 2013 at 7:49 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 16 Jan 2013, Peter Wood wrote: Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write operations on the drives and 661 on the ZIL. Compare to the other server (who is in way heavier use then this one) these numbers are extremely high. Any idea how to debug any further? Do some filesystems contain many snapshots? Do some filesystems use small zfs block sizes. Have the servers been used the same? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/** users/bfriesen/ http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, 17 Jan 2013, Peter Wood wrote: Unless there is some other way to test what/where these write operations are applied. You can install Brendan Gregg's DTraceToolkit and use it to find out who and what is doing all the writing. 1.2GB in an hour is quite a lot of writing. If this is going continuously, then it may be causing more fragmentation in conjunction with your snapshots. See http://www.brendangregg.com/dtrace.html;. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood peterwood...@gmail.com wrote: The 'zpool iostat -v' output is uncomfortably static. The values of read/write operations and bandwidth are the same for hours and even days. I'd expect at least some variations between morning and night. The load on the servers is different for sure. Any input? Without a repetition time parameter, zpool iostat will print exactly once and exit, and the output is an average from kernel boot to now, just like iostat, this is why it seems so static. If you want to know the activity over 5 second intervals, use something like zpool iostat -v 5 (repeat every 5 seconds) and wait for the second and later blocks. The second and later blocks are average from previous output until now. I generally use 5 second intervals to match the 5 second commit interval on my pools. Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On 2013-01-18 00:42, Bob Friesenhahn wrote: You can install Brendan Gregg's DTraceToolkit and use it to find out who and what is doing all the writing. 1.2GB in an hour is quite a lot of writing. If this is going continuously, then it may be causing more fragmentation in conjunction with your snapshots. As a moderately wild guess, since you're speaking of galleries, are these problematic filesystems often-read? By default ZFS updates the last access-time of files it reads, as do many other filesystems, and this causes avalanches of metadata updates - sync writes (likely) as well as fragmentation. This may also be a poorly traceable but considerable used space in frequent snapshots. You can verify (and unset) this behaviour with the ZFS FS dataset property atime, i.e.: # zfs get atime pond/export/home NAME PROPERTY VALUE SOURCE pond/export/home atime offinherited from pond On another hand, verify where your software keeps the temporary files (i.e. during uploads as may be with galleries). Again, if this is a frequently snapshotted dataset (though 1 hour is not really that frequent) then needless temp files can be held by those older snapshots. Moving such temporary works to a different dataset with a different snapshot schedule and/or to a different pool (to keep related fragmentation constrained) may prove useful. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
Right on Tim. Thanks. I didn't know that. I'm sure it's documented somewhere and I should have read it so double thanks for explaining it. On Thu, Jan 17, 2013 at 4:18 PM, Timothy Coalson tsc...@mst.edu wrote: On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood peterwood...@gmail.comwrote: The 'zpool iostat -v' output is uncomfortably static. The values of read/write operations and bandwidth are the same for hours and even days. I'd expect at least some variations between morning and night. The load on the servers is different for sure. Any input? Without a repetition time parameter, zpool iostat will print exactly once and exit, and the output is an average from kernel boot to now, just like iostat, this is why it seems so static. If you want to know the activity over 5 second intervals, use something like zpool iostat -v 5 (repeat every 5 seconds) and wait for the second and later blocks. The second and later blocks are average from previous output until now. I generally use 5 second intervals to match the 5 second commit interval on my pools. Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
Great points Jim. I have requested more information how the gallery share is being used and any temporary data will be moved out of there. About atime, it is set to on right now and I've considered to turn it off but I wasn't sure if this will effect incremental zfs send/receive. 'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right? On Thu, Jan 17, 2013 at 4:34 PM, Jim Klimov jimkli...@cos.ru wrote: On 2013-01-18 00:42, Bob Friesenhahn wrote: You can install Brendan Gregg's DTraceToolkit and use it to find out who and what is doing all the writing. 1.2GB in an hour is quite a lot of writing. If this is going continuously, then it may be causing more fragmentation in conjunction with your snapshots. As a moderately wild guess, since you're speaking of galleries, are these problematic filesystems often-read? By default ZFS updates the last access-time of files it reads, as do many other filesystems, and this causes avalanches of metadata updates - sync writes (likely) as well as fragmentation. This may also be a poorly traceable but considerable used space in frequent snapshots. You can verify (and unset) this behaviour with the ZFS FS dataset property atime, i.e.: # zfs get atime pond/export/home NAME PROPERTY VALUE SOURCE pond/export/home atime offinherited from pond On another hand, verify where your software keeps the temporary files (i.e. during uploads as may be with galleries). Again, if this is a frequently snapshotted dataset (though 1 hour is not really that frequent) then needless temp files can be held by those older snapshots. Moving such temporary works to a different dataset with a different snapshot schedule and/or to a different pool (to keep related fragmentation constrained) may prove useful. HTH, //Jim Klimov __**_ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/**mailman/listinfo/zfs-discusshttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 16 Jan 2013, Thomas Nau wrote: Dear all I've a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. 4k might be a little small. 8k will have less metadata overhead. In some cases we've seen good performance on these workloads up through 32k. Real pain is felt at 128k :-) [ stuff removed ] For disaster recovery we plan to sync the pool as often as possible to a remote location. Running send/receive after a day or so seems to take a significant amount of time wading through all the blocks and we hardly see network average traffic going over 45MB/s (almost idle 1G link). So here's the question: would increasing/decreasing the volblocksize improve the send/receive operation and what influence might show for the iSCSI side? Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. compression is a good win, too -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 17, 2013, at 8:35 AM, Jim Klimov jimkli...@cos.ru wrote: On 2013-01-17 16:04, Bob Friesenhahn wrote: If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. Also, it would make sense while you are at it to verify that the clients(i.e. VMs' filesystems) do their IOs 4KB-aligned, i.e. that their partitions start at a 512b-based sector offset divisible by 8 inside the virtual HDDs, and the FS headers also align to that so the first cluster is 4KB-aligned. This is the classical expectation. So I added an alignment check into nfssvrtop and iscsisvrtop. I've looked at a *ton* of NFS workloads from ESX and, believe it or not, alignment doesn't matter at all, at least for the data I've collected. I'll let NetApp wallow in the mire of misalignment while I blissfully dream of other things :-) Classic MSDOS MBR did not warrant that partition start, by using 63 sectors as the cylinder size and offset factor. Newer OSes don't use the classic layout, as any config is allowable; and GPT is well aligned as well. Overall, a single IO in the VM guest changing a 4KB cluster in its FS should translate to one 4KB IO in your backend storage changing the dataset's userdata (without reading a bigger block and modifying it with COW), plus some avalanche of metadata updates (likely with the COW) for ZFS's own bookkeeping. I've never seen a 1:1 correlation from the VM guest to the workload on the wire. To wit, I did a bunch of VDI and VDI-like (small, random writes) testing on XenServer and while the clients were chugging away doing 4K random I/Os, on the wire I was seeing 1MB NFS writes. In part this analysis led to my cars-and-trains analysis. In some VMware configurations, over the wire you could see a 16k read for every 4k random write. Go figure. Fortunately, those 16k reads find their way into the MFU side of the ARC :-) Bottom line: use tools like iscsisvrtop and dtrace to get an idea of what is really happening over the wire. -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, 17 Jan 2013, Peter Wood wrote: Great points Jim. I have requested more information how the gallery share is being used and any temporary data will be moved out of there. About atime, it is set to on right now and I've considered to turn it off but I wasn't sure if this will effect incremental zfs send/receive. 'zfs send -i snapshot0 snapshot1' doesn't rely on the atime, right? Zfs send does not care about atime. The access time is useless other than as a way to see how long it has been since a file was accessed. For local access (not true for NFS), Zfs is lazy about updating atime on disk and so it may not be updated on disk until the next transaction group is written (e.g. up to 5 seconds) and so it does not represent much actual load. Without this behavior, the system could become unusable. For NFS you should disable atime on the NFS client mounts. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Heavy write IO for no apparent reason
On Thu, 17 Jan 2013, Bob Friesenhahn wrote: For NFS you should disable atime on the NFS client mounts. This advice was wrong. It needs to be done on the server side. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
Thanks for all the answers more inline) On 01/18/2013 02:42 AM, Richard Elling wrote: On Jan 17, 2013, at 7:04 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us mailto:bfrie...@simple.dallas.tx.us wrote: On Wed, 16 Jan 2013, Thomas Nau wrote: Dear all I've a question concerning possible performance tuning for both iSCSI access and replicating a ZVOL through zfs send/receive. We export ZVOLs with the default volblocksize of 8k to a bunch of Citrix Xen Servers through iSCSI. The pool is made of SAS2 disks (11 x 3-way mirrored) plus mirrored STEC RAM ZIL SSDs and 128G of main memory The iSCSI access pattern (1 hour daytime average) looks like the following (Thanks to Richard Elling for the dtrace script) If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. 4k might be a little small. 8k will have less metadata overhead. In some cases we've seen good performance on these workloads up through 32k. Real pain is felt at 128k :-) My only pain so far is the time a send/receive takes without really loading the network at all. VM performance is nothing I worry about at all as it's pretty good. So key question for me is if going from 8k to 16k or even 32k would have some benefit for that problem? [ stuff removed ] For disaster recovery we plan to sync the pool as often as possible to a remote location. Running send/receive after a day or so seems to take a significant amount of time wading through all the blocks and we hardly see network average traffic going over 45MB/s (almost idle 1G link). So here's the question: would increasing/decreasing the volblocksize improve the send/receive operation and what influence might show for the iSCSI side? Matching the volume block size to what the clients are actually using (due to their filesystem configuration) should improve performance during normal operations and should reduce the number of blocks which need to be sent in the backup by reducing write amplification due to overlap blocks.. compression is a good win, too Thanks for that. I'll use your mentioned tools to drill down -- richard Thomas -- richard.ell...@richardelling.com mailto:richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss