I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors.
One thing I found interesting, and I don't know if it's significant or not, is that under the recent builds and under 2009.06, I had run "echo '::interrupts' | mdb -k" to check the interrupts used. (I don't have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don't know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I'll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it's not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote: > To update everyone, I did a complete zfs scrub, and it it generated no errors > in iostat, and I have 4.8T of > data on the filesystem so it was a fairly lengthy test. The machine also has > exhibited no evidence of > instability. If I were to start copying a lot of data to the filesystem > again though, I'm sure it would > generate errors and crash again. > > Chad > > > On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: > > Well, ok, the msi=0 thing didn't help after all. A few minutes after my > > last message a few errors showed > > up in iostat, and then in a few minutes more the machine was locked up > > hard... Maybe I will try just > > doing a scrub instead of my rsync process and see how that does. > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: > > > I don't think the hardware has any problems, it only started having > > > errors when I upgraded OpenSolaris. > > > It's still working fine again now after a reboot. Actually, I reread one > > > of your earlier messages, > > > and I didn't realize at first when you said "non-Sun JBOD" that this > > > didn't apply to me (in regards to > > > the msi=0 fix) because I didn't realize JBOD was shorthand for an > > > external expander device. Since > > > I'm just using baremetal, and passive backplanes, I think the msi=0 fix > > > should apply to me based on > > > what you wrote earlier, anyway I've put > > > set mpt:mpt_enable_msi = 0 > > > now in /etc/system and rebooted as it was suggested earlier. I've > > > resumed my rsync, and so far there > > > have been no errors, but it's only been 20 minutes or so. I should have > > > a good idea by tomorrow if this > > > definitely fixed the problem (since even when the machine was not > > > crashing it was tallying up iostat errors > > > fairly rapidly) > > > > > > Thanks again for your help. Sorry for wasting your time if the > > > previously posted workaround fixes things. > > > I'll let you know tomorrow either way. > > > > > > Chad > > > > > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: > > > > Chad Cantwell wrote: > > > > >After another crash I checked the syslog and there were some different > > > > >errors than the ones > > > > >I saw previously during operation: > > > > ... > > > > > > > > >Nov 30 20:59:13 the-vault LSI PCI device (1000,ffff) not > > > > >supported. > > > > ... > > > > >Nov 30 20:59:13 the-vault mpt_config_space_init failed > > > > ... > > > > >Nov 30 20:59:15 the-vault mpt_restart_ioc failed > > > > .... > > > > > > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: > > > > >PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major > > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 > > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: > > > > >System-Serial-Number, HOSTNAME: the-vault > > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 > > > > >Nov 30 21:33:02 the-vault EVENT-ID: > > > > >7886cc0d-4760-60b2-e06a-8158c3334f63 > > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an > > > > >invalid request. > > > > >Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R > > > > >for more information. > > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances > > > > >may be disabled > > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the > > > > >device instances associated with this fault > > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers > > > > >and patches are installed. Otherwise schedule a repair procedure to > > > > >replace the affected device(s). Us > > > > >e fmadm faulty to identify the devices or contact Sun for support. > > > > > > > > > > > > Sorry to have to tell you, but that HBA is dead. Or at > > > > least dying horribly. If you can't init the config space > > > > (that's the pci bus config space), then you've got about > > > > 1/2 the nails in the coffin hammered in. Then the failure > > > > to restart the IOC (io controller unit) == the rest of > > > > the lid hammered down. > > > > > > > > > > > > best regards, > > > > James C. McPherson > > > > -- > > > > Senior Kernel Software Engineer, Solaris > > > > Sun Microsystems > > > > http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog > > > _______________________________________________ > > > zfs-discuss mailing list > > > zfs-discuss@opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss