On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote:
> Yes. I've been looking at what the value of zfs_vdev_max_pending should be.
> The old value was 35 (a guess, but a really bad guess) and the new value is
> 10 (another guess, but a better guess).  I observe that data from a fast, 
> modern 
> HDD, for  1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 
> IOPS. 
> But as we add threads, the average response time increases from 2.3ms to 
> 137ms.

Interesting.  What happens to total throughput, since that's the
expected tradeoff against latency here.  I might guess that in your
tests with a constant io size, it's linear with IOPS - but I wonder if
that remains so for larger IO or with mixed sizes?

> Since the whole idea is to get lower response time, and we know disks are not 
> simple queues so there is no direct IOPS to response time relationship, maybe 
> it
> is simply better to limit the number of outstanding I/Os.

I also wonder if we're seeing a form of "bufferbloat" here in these
latencies.

As I wrote in another post yesterday, remember that you're not
counting actual outstanding IO's here, because the write IO's are
being acknowledged immediately and tracked internally. The disk may
therefore be getting itself into a state where either the buffer/queue
is efectively full, or the number of requests it is tracking
internally becomes inefficient (as well as the head-thrashing). 

Even before you get to that state and writes start slowing down too,
your averages are skewed by write cache. All the writes are fast,
while a longer queue exposes reads to contention with eachother, as
well as to a much wider window of writes.  Can you look at the average
response time for just the reads, even amongst a mixed r/w workflow?
Perhaps some alternate statistic than average, too.

Can you repeat the tests with write-cache disabled, so you're more
accurately exposing the controller's actual workload and backlog?

I hypothesise that this will avoid those latencies getting so
ridiculously out of control, and potentially also show better
(relative) results for higher concurrency counts.  Alternately, it
will show that your disk firmware really is horrible at managing
concurrency even for small values :)

Whether it shows better absolute results than a shorter queue + write
cache is an entirely different question.  The write cache will
certainly make things faster in the common case, which is another way
of saying that your lower-bound average latencies are artificially low
and making the degradation look worse.

> > This comment seems to indicate that the drive queues up a whole bunch of
> > requests, and since the queue is large, each individual response time has
> > become large.  It's not that physical actual performance has degraded with
> > the cache enabled, it's that the queue has become long.  For async writes,
> > you don't really care how long the queue is, but if you have a mixture of
> > async writes and occasional sync writes...  Then the queue gets long, and
> > when you sync, the sync operation will take a long time to complete.  You
> > might actually benefit by disabling the disk cache.
> > 
> > Richard, have I gotten the gist of what you're saying?
> 
> I haven't formed an opinion yet, but I'm inclined towards wanting overall
> better latency.

And, in particlar, better latency for specific (read) requests that zfs
prioritises; these are often the ones that contribute most to a system
feeling unresponsive.  If this prioritisation is lost once passed to
the disk, both because the disk doesn't have a priority mechanism and
because it's contending with the deferred cost of previous writes,
then you'll get better latency for the requests you care most about
with a shorter queue.

--
Dan.


Attachment: pgp4AqJyAubZi.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to