Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Paul Kraus
Richard,
First, thank you for the detailed reply ... (comments in line below)

On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
richard.ell...@gmail.com wrote:
 more below...

 On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:

 On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
 richard.ell...@gmail.com wrote:

 Try disabling prefetch.

 Just tried it... no change in random read (still 17-18 MB/sec for a
 single thread), but sequential read performance dropped from about 200
 MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
 accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
 arcstat.pl shows that the vast majority (95%) of reads are missing
 the cache.

 hmmm... more testing needed. The question is whether the low
 I/O rate is because of zfs itself, or the application? Disabling prefetch
 will expose the application, because zfs is not creating additional
 and perhaps unnecessary read I/O.

The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.

 Your data which shows the sequential write, random write, and
 sequential read driving actv to 35 is because prefetching is enabled
 for the read.  We expect the writes to drive to 35 with a sustained
 write workload of any flavor.

Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.

 The random read (with cache misses)
 will stall the application, so it takes a lot of threads (16?) to keep
 35 concurrent I/Os in the pipeline without prefetching.  The ZFS
 prefetching algorithm is intelligent so it actually complicates the
 interpretation of the data.

What bothers me is that that iostat is showing the 'disk' device as
not being saturated during the random read test. I'll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).

 You're peaking at 658 256KB random IOPS for the 3511, or ~66
 IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
 see something more than 66 IOPS each.  The IOPS data from
 iostat would be a better metric to observe than bandwidth.  These
 drives are good for about 80 random IOPS each, so you may be
 close to disk saturation.  The iostat data for IOPS and svc_t will
 confirm.

But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?

 The T2000 data (sheet 3) shows pretty consistently around
 90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
 less than I would expect, perhaps due to the measurement.

I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).

 Also, the 3511 RAID-5 configuration will perform random reads at
 around 1/2 IOPS capacity if the partition offset is 34.  This was the
 default long ago.  The new default is 256.

Our 3511's have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.

 The reason is that with
 a 34 block offset, you are almost guaranteed that a larger I/O will
 stride 2 disks.  You won't notice this as easily with a single thread,
 but it will be measurable with more threads. Double check the
 offset with prtvtoc or format.

How do I check offset ... format - verify from one of the partitionsis below:

format ver

Volume name = 
ascii name  = SUN-StorEdge 3511-421F-517.23GB
bytes/sector=  512
sectors = 1084710911
accessible sectors = 1084710878
Part  TagFlag First Sector  Size  Last Sector
  0usrwm   256   517.22GB   1084694494
  1 unassignedwm 000
  2 unassignedwm 000
  3 unassignedwm 000
  4 unassignedwm 000
  5 unassignedwm 000
  6 unassignedwm 000
  8   reservedwm1084694495 8.00MB   1084710878

format

 Writes are a completely different matter.  ZFS has a tendency to
 turn random writes into sequential writes, so it is pretty much
 useless to look at random write 

Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Paul Kraus
I posted baseline stats at http://www.ilk.org/~ppk/Geek/

baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size

480-3511-baseline.xls is an iozone output file

iostat-baseline.txt is the iostat output for the device in use (annotated)

I also noted an odd behavior yesterady and have not had a chance to
better qualify it. I was testing various combinations of vdev
quantities and mirror quantities.

As I changed the number of vdevs (stripes) from 1 through 8 (all
backed buy paritions on the same logical disk on the 3511) there was
no real change in sequential write, random write, or random read
performance. Sequential read performance did show a drop from 216
MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as
expected.

As I changed the number of mirro components things got interesting.
Keep in mind that I only have one 3511 for testing right now, I had to
use partitions from two other production 3511's to get three mirror
components on different arrays. As expected, as I went from 1 to 2 to
3 mirror components the write performance did not change, but the read
performance was interesting... see below:

read performance
mirrors  sequential  random
1  174 MiB/sec.  23 MiB/sec.
2  229 MiB/sec.  30 MiB/sec.
3  223 MiB/sec.  125 MiB/sec.

What they heck happened here ? 1 to 2 mirrors saw a large increase in
sequential read perfromance and from 2 to 3 mirrors show a HUGE
increase in random read performance. It feels like the behavior of
the zfs code changed between 2 and 3 mirrors for the random read data.

Now to investigate further, I tried multiple mirrors components on the
same array (my test 3511), not that you would do this in production,
but I was curious what would happen. In this case the throughput
degraded across the board as I added mirror components, as one would
expect. In the random read case the array was delivering less overall
performance than it was when it was one part of the earlier test (16
MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of
http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test
results. Sheet 8 is the last test I did last night, using the NRAID
logical disk type to try to get the 3511 to pass a disk through to
zfs, but get the advantage of the cache on the 3511. I'm not sure what
to read into those numbers.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread William D. Hathaway
If you are using (3) 3511's, then won't it be possibly that your 3GB workload 
will be largely or entirely served out of RAID controller cache?

Also, I had a question for your production backups (millions of small files), 
do you have atime=off set for the filesystems?  That might be helpful.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Mike Gerdts
On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus pk1...@gmail.com wrote:
 You're peaking at 658 256KB random IOPS for the 3511, or ~66
 IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
 see something more than 66 IOPS each.  The IOPS data from
 iostat would be a better metric to observe than bandwidth.  These
 drives are good for about 80 random IOPS each, so you may be
 close to disk saturation.  The iostat data for IOPS and svc_t will
 confirm.

 But ... if I am saturating the 3511 with one thread, then why do I get
 many times that performance with multiple threads ?

I'm having troubles making sense of the iostat data (I can't tell how
many threads at any given point), but I do see lots of times where
asvc_t * reads is in the range 850 ms to 950 ms.  That is, this is as
fast as a single threaded app with a little bit of think time can
issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1
sec).  The %busy shows that 90+% of the time there is an I/O in flight
(100 reads * 9ms = 900/1000 = 90%).  However, %busy isn't aware of how
many I/O's could be in flight simultaneously.

When you fire up more threads, you are able to have more I/O's in
flight concurrently.  I don't believe that the I/O's per drive is
really a limiting factor at the single threaded case, as the spec
sheet for the 3511 says that it has 1 GB of cache per controller.
Your working set is small enough that it is somewhat likely that many
of those random reads will be served from cache.  A dtrace analysis of
just how random the reads are would be interesting.  I think that
hotspot.d from the DTrace Toolkit would be a good starting place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Richard Elling

more below...

On Nov 25, 2009, at 5:54 AM, Paul Kraus wrote:


Richard,
   First, thank you for the detailed reply ... (comments in line  
below)


On Tue, Nov 24, 2009 at 6:31 PM, Richard Elling
richard.ell...@gmail.com wrote:

more below...

On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:


On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
richard.ell...@gmail.com wrote:


Try disabling prefetch.


Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about  
200

MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (95%) of reads are missing
the cache.


hmmm... more testing needed. The question is whether the low
I/O rate is because of zfs itself, or the application? Disabling  
prefetch

will expose the application, because zfs is not creating additional
and perhaps unnecessary read I/O.


The values reported by iozone are in pretty close agreement with what
we are seeing with iostat during the test runs. Compression is off on
zfs (the iozone test data compresses very well and yields bogus
results). I am looking for a good alternative to iozone for random
testing, I did put together a crude script to spawn many dd processes
accessing the block device itself, each with a different seek over the
range of the disk and saw results much greater than the iozone single
threaded random performance.


filebench is usually bundled in /usr/benchmarks or as a pkg.
vdbench is easy to use and very portable, www.vdbench.org


Your data which shows the sequential write, random write, and
sequential read driving actv to 35 is because prefetching is enabled
for the read.  We expect the writes to drive to 35 with a sustained
write workload of any flavor.


Understood. I tried tuning the queue size to 50 and observed that the
actv went to 50 (with very little difference in performance), so
returned it to the default of 35.


Yep, bottleneck is on the back end (physical HDDs).  For arrays with  
lots

of HDDs, this queue can be deeper, but the 3500 series is way too
small to see this.  If SSDs are used on the back end, then you can
revisit this.

From the data, it does look like the random read tests are converging
on the media capabilities of the disks in the array.  For the array you
can see the read-modify-write penalty of RAID-5 as well as the
caching and prefetching of reads.

Note: the physical I/Os are 128 KB, regardless of the iozone size
setting.  This is expected, since 128 KB is the default recordsize
limit for ZFS.


The random read (with cache misses)
will stall the application, so it takes a lot of threads (16?) to  
keep

35 concurrent I/Os in the pipeline without prefetching.  The ZFS
prefetching algorithm is intelligent so it actually complicates the
interpretation of the data.


What bothers me is that that iostat is showing the 'disk' device as
not being saturated during the random read test. I'll post iostat
output that I captured yesterday to http://www.ilk.org/~ppk/Geek/ You
can clearly see the various test phases (sequential write, rewrite,
sequential read, reread, random read, then random write).


Is this a single thread?  Usually this means that you aren't creating
enough load. ZFS won't be prefetching (as much) for a random
read workload, so iostat will expose client bottlenecks.


You're peaking at 658 256KB random IOPS for the 3511, or ~66
IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
see something more than 66 IOPS each.  The IOPS data from
iostat would be a better metric to observe than bandwidth.  These
drives are good for about 80 random IOPS each, so you may be
close to disk saturation.  The iostat data for IOPS and svc_t will
confirm.


But ... if I am saturating the 3511 with one thread, then why do I get
many times that performance with multiple threads ?


The T2000 data (sheet 3) shows pretty consistently around
90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
less than I would expect, perhaps due to the measurement.


I ran the T2000 test to see if 10U8 behaved better and to make sure I
wasn't seeing an oddity of the 480 / 3511 case. I wanted to see if the
random read bahavior was similar, and it was (in relative terms).


Also, the 3511 RAID-5 configuration will perform random reads at
around 1/2 IOPS capacity if the partition offset is 34.  This was the
default long ago.  The new default is 256.


Our 3511's have been running 421F (latest) for a long time :-) We are
religious about keeping all the 3511 FW current and matched.



The reason is that with
a 34 block offset, you are almost guaranteed that a larger I/O will
stride 2 disks.  You won't notice this as easily with a single  
thread,

but it will be measurable with more threads. Double check the
offset with prtvtoc or format.


How do I check offset ... format - verify from one of the  

Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Richard Elling

more below...

On Nov 25, 2009, at 7:10 AM, Paul Kraus wrote:


I posted baseline stats at http://www.ilk.org/~ppk/Geek/

baseline test was 1 thread, 3 GiB file, 64KiB to 512 KiB record size

480-3511-baseline.xls is an iozone output file

iostat-baseline.txt is the iostat output for the device in use  
(annotated)


I also noted an odd behavior yesterady and have not had a chance to
better qualify it. I was testing various combinations of vdev
quantities and mirror quantities.

As I changed the number of vdevs (stripes) from 1 through 8 (all
backed buy paritions on the same logical disk on the 3511) there was
no real change in sequential write, random write, or random read
performance. Sequential read performance did show a drop from 216
MiB/sec at 1 vdev to 180 MiB/sec. at 8 vdevs. This was about as
expected.

As I changed the number of mirro components things got interesting.
Keep in mind that I only have one 3511 for testing right now, I had to
use partitions from two other production 3511's to get three mirror
components on different arrays. As expected, as I went from 1 to 2 to
3 mirror components the write performance did not change, but the read
performance was interesting... see below:

read performance
mirrors  sequential  random
1  174 MiB/sec.  23 MiB/sec.
2  229 MiB/sec.  30 MiB/sec.
3  223 MiB/sec.  125 MiB/sec.

What they heck happened here ? 1 to 2 mirrors saw a large increase in
sequential read perfromance and from 2 to 3 mirrors show a HUGE
increase in random read performance. It feels like the behavior of
the zfs code changed between 2 and 3 mirrors for the random read data.


I can't explain this.  It may require a detailed understanding of the
hardware configuration to identify the potential bottleneck.

The ZFS mirroring code doesn't care how many mirrors there are, it
just goes through the list.  If the performance is not symmetrical from
all sides of the mirror, then YMMV.


Now to investigate further, I tried multiple mirrors components on the
same array (my test 3511), not that you would do this in production,
but I was curious what would happen. In this case the throughput
degraded across the board as I added mirror components, as one would
expect. In the random read case the array was delivering less overall
performance than it was when it was one part of the earlier test (16
MiB/sec. combined vs. 1/3 of 125 MiB/sec.) See sheet 7 of
http://www.ilk.org/~ppk/Geek/throughput-summary.ods for these test
results. Sheet 8 is the last test I did last night, using the NRAID
logical disk type to try to get the 3511 to pass a disk through to
zfs, but get the advantage of the cache on the 3511. I'm not sure what
to read into those numbers.


I read it as the single array, as configured, with 10+1 RAID-5 can  
deliver

around 130 random read IOPS @ 128 KB.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-24 Thread Richard Elling

Try disabling prefetch.
 -- richard

On Nov 24, 2009, at 6:45 AM, Paul Kraus wrote:


   I know there have been a bunch of discussion of various ZFS
performance issues, but I did not see anything specifically on this.
In testing a new configuration of an SE-3511 (SATA) array, I ran into
an interesting ZFS performance issue. I do not believe that this is
creating a major issue for our end users (but it may), but it is
certainly impacting our nightly backups. I am only seeing 10-20 MB/sec
per thread for random read throughput using iozone for testing. Here
is the full config:

SF-V480
--- 4 x 1.2 GHz III+
--- 16 GB memory
--- Solaris 10U6 with ZFS patch and IDR for snapshot / resilver bug.
SE-3511
--- 12 x 500 GB SATA drives
--- 11 disk R5
--- dual 2 Gbps FC host connection

I have the ARC size limited to 1 GB so that I can test with a rational
data set size. The total amount of data that I am testing with is 3 GB
and a 256KB record size. I tested with 1 through 20 threads.

With 1 thread I got the following results:
sequential write: 112 MB/sec.
sequential read: 221 MB/sec.
random write: 96 MB/sec.
random read: 18 MB/sec.

As I scaled the number of threads (and kept the total data size the
same) I got the following (throughput is in MB/sec):
threads  sw   sr  rw  rr
2  105  218  93 34
4  106  219  88  52
8  95  189  69  92
16  71 153 76  128

As the number of threads climbs the first thee values drop once you
get above 4 threads (one per CPU), but the fourth (random read) climbs
well past 4 threads. It is just about linear through 9 threads and
then it starts fluctuating, but continues climbing to at least 20
threads (I did not test past 20). Above 16 threads the random read
even exceeds the sequential read values.

Looking at iostat output for the LUN I am using for the 1 thread case,
for the first three tests (sequential write, sequential read, random
write) I see %b at 100 and actv climb to 35 and hang out there. For
the random read test I see %b at 5 to 7, actv at less than 1 (usually
around 0.5 to 0.6), wsvc_t is essentially 0, and asvc_t runs about 14.
As the number of threads increases, the iostat values don't really
change for the first three tests (sequential write and read), but they
climb for the random read. The array is close to saturated at about
170 MB/sec. random read (18 threads), so I know that the 18 MB/sec.
value for one thread is _not_limited by the array.

I know the 3511 is not a high performance array, but we needed lots of
bulk storage and could not afford better when we bought these 3 years
ago. But, it seems to me that there is something wrong with the random
read performance of ZFS. To test whether this is an effect of the 3511
I ran some tests on another system we have, as follows:

T2000
--- 32 thread 1 GHz
--- 32 GB memory
--- Solaris 10U8
--- 4 Internal 72 GB SAS drives

We have a zpool built of one slice on each of the 4 internal drives
configured as a striped mirror layout (2 vdevs each of 2 slices). So
I/O is spread over all 4 spindles. I started with 4 threads and 8 GB
each (32 GB total to insure I got past the ARC, it is not tuned down
on this system). I saw exactly the same ratio of sequential read to
random read (the random read performance was 23% of the sequential
read performance in both cases). Based on looking at iostat values
during the test, I am saturating all four drives with the write
operations with just 1 thread. The sequential read is saturating the
drives with anything more than 1 thread, and the random read is not
saturating the drives until I get to about 6 threads.

threads  sw  sr  rw  rr
1  100  207  88  30
2  103  370  88  53
4  98  350  90  82
8  101  434  92  95

I confirmed that the problem is not unique to either 10U6 or the IDR,
10U8 has the same behavior.

I confirmed that the problem is not unique to a FC attached disk array
or the SE-3511 in particular.

Then I went back and took another look at my original data
(SF-V480/SE-3511) and looked at throughput per thread. For the
sequential operations and the random write, the throughput per thread
fell pretty far and pretty fast, but the per thread random read
numbers fell very slowly.

Per thread throughput in MB/sec.
threads  sw  sr  rw  rr
1  112  221  96  18
2  53  109  46  17
4  26  55  22  13
8  12  24  9  12
16  5  10  5  8

So this makes me think that the random read performance issue is a
limitation per thread. Does anyone have any idea why ZFS is not
reading as fast as the underlying storage can handle in the case of
random reads ? Or am I seeing an artifact of iozone itself ? Is there
another benchmark I should be using ?

P.S. I posted a OpenOffice.org spreadsheet of my test resulsts here:
http://www.ilk.org/~ppk/Geek/throughput-summary.ods

--  
{1

-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ 
 )

- Sound Coordinator, Schenectady Light Opera Company (

Re: [zfs-discuss] ZFS Random Read Performance

2009-11-24 Thread Paul Kraus
On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
richard.ell...@gmail.com wrote:

 Try disabling prefetch.

Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about 200
MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (95%) of reads are missing
the cache.

The reason I don't think that this ishitting our end users is the
cache hit ratio (reported by arc_summary.pl) is 95% on the production
system (I am working on our test system and am the only one using it
right now, so all the I/O load is iozone).

I think my next step (beyond more poking with DTrace) is to try a
backup and see what I get for ARC hit ratio ... I expect it to be low,
but I may be surprised (then I have to figure out why backups are as
slow as they are). We are using NetBackup and it takes about 3 days to
do a FULL on a 3.3 TB zfs with about 30 million files. Differential
Incrementals take 16-22 hours (and almost no data changes). The
production server is an M4000, 4 dual core CPUs, 16 GB memory, and
about 25 TB of data overall. A big SAMBA file server.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
- Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
- Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
- Technical Advisor, Lunacon 2010 (http://www.lunacon.org/)
- Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-24 Thread Bob Friesenhahn

On Tue, 24 Nov 2009, Paul Kraus wrote:


On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
richard.ell...@gmail.com wrote:


Try disabling prefetch.


Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about 200
MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (95%) of reads are missing
the cache.


You will often see the best random access performance if you access 
the data using the same record size that zfs uses.  For example, if 
you request data in 256KB records, but zfs is using 128KB records, 
then zfs needs to access, reconstruct, and concatenate two 128K zfs 
records before it can return any data to the user.  This increases the 
access latency and decreases opportunity to take advantage of 
concurrency.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Random Read Performance

2009-11-24 Thread Richard Elling

more below...

On Nov 24, 2009, at 9:29 AM, Paul Kraus wrote:


On Tue, Nov 24, 2009 at 11:03 AM, Richard Elling
richard.ell...@gmail.com wrote:


Try disabling prefetch.


Just tried it... no change in random read (still 17-18 MB/sec for a
single thread), but sequential read performance dropped from about 200
MB/sec. to 100 MB/sec. (as expected). Test case is a 3 GB file
accessed in 256 KB records. ARC is set to a max of 1 GB for testing.
arcstat.pl shows that the vast majority (95%) of reads are missing
the cache.


hmmm... more testing needed. The question is whether the low
I/O rate is because of zfs itself, or the application? Disabling  
prefetch

will expose the application, because zfs is not creating additional
and perhaps unnecessary read I/O.

Your data which shows the sequential write, random write, and
sequential read driving actv to 35 is because prefetching is enabled
for the read.  We expect the writes to drive to 35 with a sustained
write workload of any flavor. The random read (with cache misses)
will stall the application, so it takes a lot of threads (16?) to keep
35 concurrent I/Os in the pipeline without prefetching.  The ZFS
prefetching algorithm is intelligent so it actually complicates the
interpretation of the data.

You're peaking at 658 256KB random IOPS for the 3511, or ~66
IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
see something more than 66 IOPS each.  The IOPS data from
iostat would be a better metric to observe than bandwidth.  These
drives are good for about 80 random IOPS each, so you may be
close to disk saturation.  The iostat data for IOPS and svc_t will
confirm.

The T2000 data (sheet 3) shows pretty consistently around
90 256KB IOPS per drive. Like the 3511 case, this is perhaps 20%
less than I would expect, perhaps due to the measurement.

Also, the 3511 RAID-5 configuration will perform random reads at
around 1/2 IOPS capacity if the partition offset is 34.  This was the
default long ago.  The new default is 256. The reason is that with
a 34 block offset, you are almost guaranteed that a larger I/O will
stride 2 disks.  You won't notice this as easily with a single thread,
but it will be measurable with more threads. Double check the
offset with prtvtoc or format.

Writes are a completely different matter.  ZFS has a tendency to
turn random writes into sequential writes, so it is pretty much
useless to look at random write data. The sequential writes
should easily blow through the cache on the 3511.  Squinting
my eyes, I would expect the array can do around 70 MB/s
writes, or 25 256KB IOPS saturated writes.  By contrast, the
T2000 JBOD data shows consistent IOPS at the disk level
and exposes the track cache affect on the sequential read test.

Did I mention that I'm a member of BAARF?  www.baarf.com :-)

Hint: for performance work with HDDs, pay close attention to
IOPS, then convert to bandwidth for the PHB.


The reason I don't think that this ishitting our end users is the
cache hit ratio (reported by arc_summary.pl) is 95% on the production
system (I am working on our test system and am the only one using it
right now, so all the I/O load is iozone).

I think my next step (beyond more poking with DTrace) is to try a
backup and see what I get for ARC hit ratio ... I expect it to be low,
but I may be surprised (then I have to figure out why backups are as
slow as they are). We are using NetBackup and it takes about 3 days to
do a FULL on a 3.3 TB zfs with about 30 million files. Differential
Incrementals take 16-22 hours (and almost no data changes). The
production server is an M4000, 4 dual core CPUs, 16 GB memory, and
about 25 TB of data overall. A big SAMBA file server.


b119 has improved stat() performance, which should make a positive
improvement of such backups.  But eventually you may need to move
to a multi-stage backup, depending on your business requirements.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss