Re: [zfs-discuss] Performance problems due to smaller ZFS recordsize

Jim Nissen Mon, 01 Nov 2010 14:22:39 -0700

 Jim,

They are running Solaris 10 11/06 (u3) with kernel patch 142900-12. Seeinline for the rest...


On 10/25/10 11:19 AM, Jim Mauro wrote:

Hi Jim - cross-posting to zfs-discuss, because 20X is, to say theleast, compelling.
Obviously, it would be awesome if we had the opportunity towhittle-down which of
the changes made this fly, or if it was a combination of the changes.
Looking at them individually....

Not sure we can convince this customer to comply, on the system inquestion. HOWEVER, they also have another set of ldap servers that areexperiencing the same types of problems with backups. I will see ifthey would be willing to make single changes in steps.


I think the best bet would be to reproduce in a lab, somewhere.

set zfs:zfs_vdev_cache_size = 0
The default for this is 10MB per vdev, and as I understand it (whichmay be wrong)
is part of the device-level prefetch on reads.
set zfs:zfs_vdev_cache_bshift = 13
This obscurely named parameter defines the amount of data read fromdisk foreach disk read (I think). The default value for this parameter is 16,equating to
64k reads. The value of 13 reduces disk read sizes to 8k.
set zfs:zfs_prefetch_disable = 1
The vdev parameters above relate to device-level prefetching.
zfs_prefetch_disable applies to file level prefetching.
With regard to the COW/scattered blocks query, it is certainly apossible side-effectof COW that maintaining sequential file block layout can getchallenging, but
the TXG model and coalescing writes helps with that.

I was describing to the customer how ZFS uses COW for modifications, andhow it is possible, over time, to get fragmentation. From a LDAPstandpoint, they suggested there are lots of cases where modificationsare made to already existing larger files. In some respects, LDAP ismuch like Oracle databases, just on a smaller scale.


Is there any way to monitor for fragmentation?  Any dtrace scripts, perhaps?

With regard to the changes (including the ARC size increase), it'sreally impossible tosay without data the extent with which prefetching at one or bothlayers made thedifference here. Was it the cumulative effect of both, or was one amuch larger
contributing factor?

Understood. On the next system they try this on, they will leave ARC at4GB to see if they still have large gains.


It would be interesting to reproduce this in a lab.

I was thinking the same thing. We would have to come up with some sortof workload where portions of larger files get modified many times, andtry some of the tunables on sequential reads.

Jim

What release of Solaris 10 is this?

Thanks
/jim
... and increased their ARC to 8GB and backups that took 15+ hoursnow take 45 minutes. They are still analyzing what effectsre-enabling prefetch has on their applications.
One other thing they noticed, before removing these tunables, is thatthe backups were taking progressly longer, each day. For instance,at the beginning of last, they took 12 hours. By Friday, they weretaking 17 hours. This is with similar sized datasets. They will bekeeping an eye on this, too, but I'm interested of any possiblecauses that might be related to ZFS. One thing I've been told isthat ZFS COW (copy-on write) operations can cause blocks to bescattered across a disk, where they were once located closer to oneanother.
We'll see how it behaves in the next week or so.

Thanks for the feedback,
Jim

On 10/21/10 02:49 PM, Amer Ather wrote:
Jim,
For sequential IO read performance, you need file system read ahead.By setting:
set zfs:zfs_prefetch_disable = 1
You have disabled zfs prefetch that is needed to boost sequential IOperformance. Normally, we recommend to disable it for Oracle OLTPtype of workload to avoid IO inflation due to read ahead. However,for backups it needs to be enabled. Take this setting out ofetcsystem file and retest.
Amer.



On 10/21/10 12:00 PM, Jim Nissen wrote:
I working with a customer who is having Directory server backupperformance problems, since switching to ZFS. In short, backupsthat used to take 1 - 4 hours on UFS are now taking 12+ hours onZFS. We've figured out that ZFS reads seem to be throttled, wherewrites seem really fast. Backend Storage is IBM SVC.
As part of their cutover, they were given the following BestPractice recommendations from LDAP folks @Sun...
/etc/system tunables:
set zfs:zfs_arc_max = 0x100000000
set zfs:zfs_vdev_cache_size = 0
set zfs:zfs_vdev_cache_bshift = 13
set zfs:zfs_prefetch_disable = 1
set zfs:zfs_nocacheflush = 1

At ZFS filesystem level:
recordsize = 32K
noatime
One of the things they noticed is that simple dd reads from one ofthe 132K recordsize filesystems runs much faster (4 - 7 times) thantheir 32K filesystems. I joined a shared-shell where we switchedthe same filesystem from 32K to 128K, and we could see underlyingdisks were getting 4x better throughput (from 1.5 - 2MB/sec to 8 -10MB/s), whereas a direct dd against one of the disks shows thatthe disks were capable of much more (45+ MB/sec).
Here are some snippets from iostat...
ZFS recordsize of 32K, dd if=./somelarge5gfile of=/dev/null bs=16k(to mimic application blocksizes)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
67.6 0.0 2132.7 0.0 0.0 0.3 0.0 4.5 0 30c6t60050768018E82BDA800000000000565d067.4 0.0 2156.8 0.0 0.0 0.1 0.0 1.5 0 10c6t60050768018E82BDA800000000000564d068.4 0.0 2158.3 0.0 0.0 0.3 0.0 4.5 0 31c6t60050768018E82BDA800000000000563d066.2 0.0 2118.4 0.0 0.0 0.2 0.0 3.4 0 22c6t60050768018E82BDA800000000000562d0
ZFS recordsize of 128K, dd if=./somelarge5gfile of=/dev/null bs=16k(to mimic application blocksizes)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
78.2 0.0 10009.6 0.0 0.0 0.2 0.0 1.9 0 15c6t60050768018E82BDA800000000000565d078.6 0.0 9960.0 0.0 0.0 0.1 0.0 1.2 0 10c6t60050768018E82BDA800000000000564d079.4 0.0 10062.3 0.0 0.0 0.4 0.0 4.4 0 35c6t60050768018E82BDA800000000000563d076.6 0.0 9804.8 0.0 0.0 0.2 0.0 2.3 0 17c6t60050768018E82BDA800000000000562d0
dd if=/dev/rdsk/c6t60050768018E82BDA800000000000564d0s0of=/dev/null bs=32k (to mimic small ZFS blocksize)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
3220.9 0.0 51533.9 0.0 0.0 0.9 0.0 0.3 1 94c6t60050768018E82BDA800000000000564d0
So, it's not like the underlying disk isn't capable of much morethan what ZFS is asking of it. I understand the part where it willhave to 4x as much work with 32K blocksize as with 128K, but itdoesn't seem as if ZFS is doing much at all with underlying disks.
We've ask the customer to rerun the test without /etc/systemtunables. Anybody else worked a similar issue? Any hints providedwould be greatly appreciated.
Thanks!

Jim

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance problems due to smaller ZFS recordsize

Reply via email to