On May 23, 2010, at 6:01 AM, Demian Phillips wrote:
> On Sat, May 22, 2010 at 11:33 AM, Bob Friesenhahn
> <bfrie...@simple.dallas.tx.us> wrote:
>> On Fri, 21 May 2010, Demian Phillips wrote:
>> 
>>> For years I have been running a zpool using a Fibre Channel array with
>>> no problems. I would scrub every so often and dump huge amounts of
>>> data (tens or hundreds of GB) around and it never had a problem
>>> outside of one confirmed (by the array) disk failure.
>>> 
>>> I upgraded to sol10x86 05/09 last year and since then I have
>>> discovered any sufficiently high I/O from ZFS starts causing timeouts
>>> and off-lining disks. This leads to failure (once rebooted and cleaned
>>> all is well) long term because you can no longer scrub reliably.
>> 
>> The problem could be with the device driver, your FC card, or the array
>> itself.  In my case, issues I thought were to blame on my motherboard or
>> Solaris were due to a defective FC card and replacing the card resolved the
>> problem.
>> 
>> If the problem is that your storage array is becoming overloaded with
>> requests, then try adding this to your /etc/system file:
>> 
>> * Set device I/O maximum concurrency
>> *
>> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
>> set zfs:zfs_vdev_max_pending = 5

I would lower it even farther.  Perhaps 2.

> I've gone back to Solaris 10 11/06.
> It's working fine, but I notice some differences in performance that
> are I think key to the problem.

Yep, lots of performance improvements were added later.

> With the latest Solaris 10 (u8) throughput according to zpool iostat
> was hitting about 115MB/sec sometimes a little higher.

That should be about right for the A5100.

> With 11/06 it maxes out at 40MB/sec.
> 
> Both setups are using mpio devices as far as I can tell.
> 
> Next is to go back to u8 and see if the tuning you suggested will
> help. It really looks to me that the OS is asking too much of the FC
> chain I have.

I think that is a nice way of saying it.

> The really puzzling thing is I just got told about a brand new Dell
> Solaris x86 production box using current and supported FC devices and
> a supported SAN get the same kind of problems when a scrub is run. I'm
> going to investigate that and see if we can get a fix from Oracle as
> that does have a support contract. It may shed some light on the issue
> I am seeing on the older hardware.

The scrub workload is no different than any other stress test. I'm sure you
can run a benchmark or three on the raw device and get the same error
messages.

FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life 
(EOSL) in 2006. Personally, I  hate them with a passion and would like to 
extend an offer to use my tractor to bury the beast :-). 
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to