I eventually performed a few more tests, adjusting some zfs tuning options 
which had no effect, and trying the
itmpt driver which someone had said would work, and regardless my system would 
always freeze quite rapidly in
snv 127 and 128a.  Just to double check my hardware, I went back to the 
opensolaris 2009.06 release version, and
everything is working fine.  The system has been running a few hours and copied 
a lot of data and not had any
trouble, mpt syslog events, or iostat errors.

One thing I found interesting, and I don't know if it's significant or not, is 
that under the recent builds and
under 2009.06, I had run "echo '::interrupts' | mdb -k" to check the interrupts 
used.  (I don't have the printout
handy for snv 127+, though).

I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 
and e1000g1.  In snv 127+, each of
my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ 
listing, whereas in opensolaris
2009.06, all 4 devices are on different IRQs.  I don't know if this is 
significant, but most of my testing when
I encountered errors was data transfer via the network, so it could have 
potentially been interfering with the
mpt drivers when it was on the same IRQ.  The errors did seem to be less 
frequent when the server I was copying
from was linked at 100 instead of 1000 (one of my tests), but that is as likely 
to be a result of the slower zpool
throughput as it is to be related to the network traffic.

I'll probably stay with 2009.06 for now since it works fine for me, but I can 
try a newer build again once some
more progress is made in this area and people want to see if its fixed (this 
machine is mainly to backup another
array so it's not too big a deal to test later when the mpt drivers are looking 
better and wipe again in the event
of problems)

Chad

On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote:
> To update everyone, I did a complete zfs scrub, and it it generated no errors 
> in iostat, and I have 4.8T of
> data on the filesystem so it was a fairly lengthy test.  The machine also has 
> exhibited no evidence of
> instability.  If I were to start copying a lot of data to the filesystem 
> again though, I'm sure it would
> generate errors and crash again.
> 
> Chad
> 
> 
> On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote:
> > Well, ok, the msi=0 thing didn't help after all.  A few minutes after my 
> > last message a few errors showed
> > up in iostat, and then in a few minutes more the machine was locked up 
> > hard...  Maybe I will try just
> > doing a scrub instead of my rsync process and see how that does.
> > 
> > Chad
> > 
> > 
> > On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote:
> > > I don't think the hardware has any problems, it only started having 
> > > errors when I upgraded OpenSolaris.
> > > It's still working fine again now after a reboot.  Actually, I reread one 
> > > of your earlier messages,
> > > and I didn't realize at first when you said "non-Sun JBOD" that this 
> > > didn't apply to me (in regards to
> > > the msi=0 fix) because I didn't realize JBOD was shorthand for an 
> > > external expander device.  Since
> > > I'm just using baremetal, and passive backplanes, I think the msi=0 fix 
> > > should apply to me based on
> > > what you wrote earlier, anyway I've put 
> > >   set mpt:mpt_enable_msi = 0
> > > now in /etc/system and rebooted as it was suggested earlier.  I've 
> > > resumed my rsync, and so far there
> > > have been no errors, but it's only been 20 minutes or so.  I should have 
> > > a good idea by tomorrow if this
> > > definitely fixed the problem (since even when the machine was not 
> > > crashing it was tallying up iostat errors
> > > fairly rapidly)
> > > 
> > > Thanks again for your help.  Sorry for wasting your time if the 
> > > previously posted workaround fixes things.
> > > I'll let you know tomorrow either way.
> > > 
> > > Chad
> > > 
> > > On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote:
> > > > Chad Cantwell wrote:
> > > > >After another crash I checked the syslog and there were some different 
> > > > >errors than the ones
> > > > >I saw previously during operation:
> > > > ...
> > > > 
> > > > >Nov 30 20:59:13 the-vault       LSI PCI device (1000,ffff) not 
> > > > >supported.
> > > > ...
> > > > >Nov 30 20:59:13 the-vault       mpt_config_space_init failed
> > > > ...
> > > > >Nov 30 20:59:15 the-vault       mpt_restart_ioc failed
> > > > ....
> > > > 
> > > > >Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: 
> > > > >PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major
> > > > >Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009
> > > > >Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: 
> > > > >System-Serial-Number, HOSTNAME: the-vault
> > > > >Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16
> > > > >Nov 30 21:33:02 the-vault EVENT-ID: 
> > > > >7886cc0d-4760-60b2-e06a-8158c3334f63
> > > > >Nov 30 21:33:02 the-vault DESC: The transmitting device sent an 
> > > > >invalid request.
> > > > >Nov 30 21:33:02 the-vault   Refer to http://sun.com/msg/PCIEX-8000-8R 
> > > > >for more information.
> > > > >Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances 
> > > > >may be disabled
> > > > >Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the 
> > > > >device instances associated with this fault
> > > > >Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers 
> > > > >and patches are installed. Otherwise schedule a repair procedure to 
> > > > >replace the affected device(s).  Us
> > > > >e fmadm faulty to identify the devices or contact Sun for support.
> > > > 
> > > > 
> > > > Sorry to have to tell you, but that HBA is dead. Or at
> > > > least dying horribly. If you can't init the config space
> > > > (that's the pci bus config space), then you've got about
> > > > 1/2 the nails in the coffin hammered in. Then the failure
> > > > to restart the IOC (io controller unit) == the rest of
> > > > the lid hammered down.
> > > > 
> > > > 
> > > > best regards,
> > > > James C. McPherson
> > > > --
> > > > Senior Kernel Software Engineer, Solaris
> > > > Sun Microsystems
> > > > http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
> > > _______________________________________________
> > > zfs-discuss mailing list
> > > zfs-discuss@opensolaris.org
> > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to