On May 23, 2010, at 6:01 AM, Demian Phillips wrote: > On Sat, May 22, 2010 at 11:33 AM, Bob Friesenhahn > <bfrie...@simple.dallas.tx.us> wrote: >> On Fri, 21 May 2010, Demian Phillips wrote: >> >>> For years I have been running a zpool using a Fibre Channel array with >>> no problems. I would scrub every so often and dump huge amounts of >>> data (tens or hundreds of GB) around and it never had a problem >>> outside of one confirmed (by the array) disk failure. >>> >>> I upgraded to sol10x86 05/09 last year and since then I have >>> discovered any sufficiently high I/O from ZFS starts causing timeouts >>> and off-lining disks. This leads to failure (once rebooted and cleaned >>> all is well) long term because you can no longer scrub reliably. >> >> The problem could be with the device driver, your FC card, or the array >> itself. In my case, issues I thought were to blame on my motherboard or >> Solaris were due to a defective FC card and replacing the card resolved the >> problem. >> >> If the problem is that your storage array is becoming overloaded with >> requests, then try adding this to your /etc/system file: >> >> * Set device I/O maximum concurrency >> * >> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 >> set zfs:zfs_vdev_max_pending = 5
I would lower it even farther. Perhaps 2. > I've gone back to Solaris 10 11/06. > It's working fine, but I notice some differences in performance that > are I think key to the problem. Yep, lots of performance improvements were added later. > With the latest Solaris 10 (u8) throughput according to zpool iostat > was hitting about 115MB/sec sometimes a little higher. That should be about right for the A5100. > With 11/06 it maxes out at 40MB/sec. > > Both setups are using mpio devices as far as I can tell. > > Next is to go back to u8 and see if the tuning you suggested will > help. It really looks to me that the OS is asking too much of the FC > chain I have. I think that is a nice way of saying it. > The really puzzling thing is I just got told about a brand new Dell > Solaris x86 production box using current and supported FC devices and > a supported SAN get the same kind of problems when a scrub is run. I'm > going to investigate that and see if we can get a fix from Oracle as > that does have a support contract. It may shed some light on the issue > I am seeing on the older hardware. The scrub workload is no different than any other stress test. I'm sure you can run a benchmark or three on the raw device and get the same error messages. FWIW, the A5100 went end-of-life (EOL) in 2001 and end-of-service-life (EOSL) in 2006. Personally, I hate them with a passion and would like to extend an offer to use my tractor to bury the beast :-). -- richard -- ZFS and NexentaStor training, Rotterdam, July 13-15, 2010 http://nexenta-rotterdam.eventbrite.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss