Hi Jason,
It seems to me that there is some detailed information which would
be needed for a full analysis.  So, to keep the ball rolling, I'll
respond generally.

Jason J. W. Williams wrote:
Hi Richard,

Been watching the stats on the array and the cache hits are < 3% on
these volumes. We're very write heavy, and rarely write similar enough
data twice. Having random oriented database data and
sequential-oriented database log data on the same volume groups, it
seems to me this was causing a lot of head repositioning.

In general, writes are buffered.  For many database workloads, the
sequential log writes won't be write cache hits and will be coelesced.
There are several ways you could account for this, but suffice to say
that the read cache hit rate is more interesting, for performance
improvement opportunities.  The random reads are often cache misses,
and adding prefetch is often a waste of resources -- the nature of the
beast.

For ZFS, all data writes should be sequential until you get near the
capacity of the volume when there will be a search for free blocks
which may be randomly dispersed.  One way to look at this is that
for new and not-yet-filled volumes, ZFS will write sequentially,
unlike other file systems.  Once you get filled, then ZFS will write
more like other file systems.  Hence, your write performance for
ZFS may change over time, though this will be somewhat mitigated by
the RAID array write buffer cache.

By shutting down the slaves database servers we cut the latency
tremendously, which would seem to me to indicate a lot of contention.
But I'm trying to come up to speed on this, so I may be wrong.

This is likely.

Note that RAID controllers are really just servers which speak a
block-level protocol to other hosts.  Some RAID controllers are
underpowered.

ZFS on a modern server can create a significant workload.  This
can also clobber a RAID array.  For example, by default, ZFS will
queue up to 35 iops per vdev before blocking.  If you have one
RAID array which is connected to 4 hosts, each host having 5 vdevs,
then the RAID array would need to be able to handle 700 (35 * 4 * 5)
concurrent iops.  There are RAID arrays, which will remain nameless,
that will not handle that workload very well.  Under lab conditions
you should be able to empirically determine the knee in the response
time curve as you add workload.

To compound the problem, fibre channel has pitiful flow control.
Thus it may also be necessary to throttle the concurrent iops at the
source.  I'm  not sure what the current thinking is on tuning
vq_max_pending (35) for ZFS, you might search for it in the archives.
[the intent is to have no tunables, let the system figure out what to
do best]

"iostat -xtcnz 5" showed the latency dropped from 200 to 20 once we
cut the replication. Since the masters and slaves were using the same
the volume groups and RAID-Z was striping across all of them on both
the masters and slaves, I think this was a big problem.

It is hard for me to visualize your setup, but this is a tell-tale
sign that you've overrun the RAID box.  Changing the volume partitioning
will likely help, perhaps tremendously.
 -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to