Re: [zfs-discuss] ZFS Random Read Performance

Paul Kraus Wed, 25 Nov 2009 05:55:12 -0800

Richard,
        First, thank you for the detailed reply ... (comments in line below)

On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
<richard.ell...@gmail.com> wrote:
> more below...
>
> On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:
>
>> On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
>> <richard.ell...@gmail.com> wrote:
>>
>>> Try disabling prefetch.
>>
>> Just tried it... no change in random read (still 17-18 MB/sec for a
>> single thread), but sequential read performance dropped from about 200
>> MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
>> accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
>> arcstat.pl shows that the vast majority (>95%) of reads are missing
>> the cache.
>
> hmmm... more testing needed. The question is whether the low
> I/O rate is because of zfs itself, or the application? Disabling prefetch
> will expose the application, because zfs is not creating additional
> and perhaps unnecessary read I/O.

The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.

> Your data which shows the sequential write, random write, and
> sequential read driving actv to 35 is because prefetching is enabled
> for the read.  We expect the writes to drive to 35 with a sustained
> write workload of any flavor.

Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.

> The random read (with cache misses)
> will stall the application, so it takes a lot of threads (>>16?) to keep
> 35 concurrent I/Os in the pipeline without prefetching.  The ZFS
> prefetching algorithm is "intelligent" so it actually complicates the
> interpretation of the data.

What bothers me is that that iostat is showing the 'disk' device as
not being saturated during the random read test. I'll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).

> You're peaking at 658 256KB random IOPS for the 3511, or ~66
> IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
> see something more than 66 IOPS each.  The IOPS data from
> iostat would be a better metric to observe than bandwidth.  These
> drives are good for about 80 random IOPS each, so you may be
> close to disk saturation.  The iostat data for IOPS and svc_t will
> confirm.

But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?

> The T2000 data (sheet 3) shows pretty consistently around
> 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
> less than I would expect, perhaps due to the measurement.

I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).

> Also, the 3511 RAID-5 configuration will perform random reads at
> around 1/2 IOPS capacity if the partition offset is 34.  This was the
> default long ago.  The new default is 256.

Our 3511's have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.

> The reason is that with
> a 34 block offset, you are almost guaranteed that a larger I/O will
> stride 2 disks.  You won't notice this as easily with a single thread,
> but it will be measurable with more threads. Double check the
> offset with prtvtoc or format.

How do I check offset ... format -> verify from one of the partitionsis below:

format> ver

Volume name = <        >
ascii name  = <SUN-StorEdge 3511-421F-517.23GB>
bytes/sector    =  512
sectors = 1084710911
accessible sectors = 1084710878
Part      Tag    Flag     First Sector          Size          Last Sector
  0        usr    wm               256       517.22GB           1084694494
  1 unassigned    wm                 0            0                0
  2 unassigned    wm                 0            0                0
  3 unassigned    wm                 0            0                0
  4 unassigned    wm                 0            0                0
  5 unassigned    wm                 0            0                0
  6 unassigned    wm                 0            0                0
  8   reserved    wm        1084694495         8.00MB           1084710878

format>

> Writes are a completely different matter.  ZFS has a tendency to
> turn random writes into sequential writes, so it is pretty much
> useless to look at random write data. The sequential writes
> should easily blow through the cache on the 3511.

I am seeing cache utilization of 25-30% during write tests, with
occasional peaks close to 50%. Which is expected as I am testing
against one partition on one logical drive.

>  Squinting
> my eyes, I would expect the array can do around 70 MB/s
> writes, or 25 256KB IOPS saturated writes.

iostat and the 3511 transfer rate monitor is showing peaks of 150-180
MB/sec with sustained throughput of 100 MB/sec.

>  By contrast, the
> T2000 JBOD data shows consistent IOPS at the disk level
> and exposes the track cache affect on the sequential read test.

Yup, it is clear that we are easily hitting the read i/o limits of the
drives in the T2000.

> Did I mention that I'm a member of BAARF?  www.baarf.com :-)

Not yet :-)

> Hint: for performance work with HDDs, pay close attention to
> IOPS, then convert to bandwidth for the PHB.

PHB ???

I do look at IOPs, but what struck me as odd was the disparate results.

<snip>

> b119 has improved stat() performance, which should make a positive
> improvement of such backups.  But eventually you may need to move
> to a multi-stage backup, depending on your business requirements.

Due to contract issues (I am consulting at a government agency), we
cannot yet run OpenSolaris in production.

On our previous server for this application (Apple G5) we had 4 TB of
data and about 50 million files (under HFS+) and a full backup took 3
WEEKS. We went the route of explicitly specifying each directory in
the NetBackup config and got _some_ reliability. Today we have about
22 TB in over 200 ZFS datasets (not evenly distributed,
unfortunately), the largest of which is about 3.5 TB and 30 million
files.

BTW, our overall configuration is based on h/w we bought years ago and
are having to adopt as best we can. We are pushing to replace the
SE-3511 arrays with J4400 JBODs. Our current config has 11 disk R5
sets and 1 hot spare per 3511 tray, we carve up 'standard' 512 GB
partitions, which we mirror at the ZPOOL layer across 3511 arrays. We
just add additional mirror pairs as the data in each department grows,
keeping the mirrors on different arrays :-)

More testing results in a separate email, this one is already too long.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
-> Technical Advisor, RPI Players
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Random Read Performance

Reply via email to