Re: [zfs-discuss] mpt errors on snv 127
fyi to everyone, the Asus P5W64 motherboard previously in my opensolaris machine was the culprit, and not the general mpt issues. At the time the motherboard was originally put in that machine, there was not enough zfs i/o load to trigger the problem which led to the false impression the hardware was fine. I'm using a 5400 chipset xeon board now (asus dseb-gh) and my LSI cards are working perfectly again; over 2 hours of heavy I/O and no errors or warnings with snv 127 (with the P5W64/LSI combo with build 127 it would never run more than 15 minutes without warnings). I chose this board partly since it has PCI-X slots and I thought those might be useful for AOC-SAT2-MV8 cards if I couldn't shake the mpt issues, but now that the mpt issues are gone I can continue with that controller if I want. Thanks everyone for your help, Chad On Sun, Dec 06, 2009 at 11:12:50PM -0800, Chad Cantwell wrote: Thanks for the info on the yukon driver. I realize too many variables makes things impossible to determine, but I had made these hardware changes awhile back, and they seemed to work fine at the time. Since they aren't now, even in the older OpenSolaris (i've tried 2009.06 and 2008.11 now), the problem seems to be a hardware quirk, and the only way to narrow that down is to change hardware back until it works like it used to in at least the older snv builds. I've ruled out the ethernet controller. I'm leaning toward the current motherboard (Asus P5W64) not playing nicely with the LSI cards, but it will probably be several days until I get to the bottom of this since it takes awhile to test after making a change... Thanks, Chad On Mon, Dec 07, 2009 at 11:09:39AM +1000, James C. McPherson wrote: Gday Chad, the more swaptronics you partake in, the more difficult it is going to be for us (collectively) to figure out what is going wrong on your system. Btw, since you're running a build past 124, you can use the yge driver instead of the yukonx (from Marvell) or myk (from Murayama-san) drivers. As another comment in this thread has mentioned, a full scrub can be a serious test of your hardware depending on how much data you've got to walk over. If you can keep the hardware variables to a minimum then clarity will be more achievable. thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Thanks for the info on the yukon driver. I realize too many variables makes things impossible to determine, but I had made these hardware changes awhile back, and they seemed to work fine at the time. Since they aren't now, even in the older OpenSolaris (i've tried 2009.06 and 2008.11 now), the problem seems to be a hardware quirk, and the only way to narrow that down is to change hardware back until it works like it used to in at least the older snv builds. I've ruled out the ethernet controller. I'm leaning toward the current motherboard (Asus P5W64) not playing nicely with the LSI cards, but it will probably be several days until I get to the bottom of this since it takes awhile to test after making a change... Thanks, Chad On Mon, Dec 07, 2009 at 11:09:39AM +1000, James C. McPherson wrote: Gday Chad, the more swaptronics you partake in, the more difficult it is going to be for us (collectively) to figure out what is going wrong on your system. Btw, since you're running a build past 124, you can use the yge driver instead of the yukonx (from Marvell) or myk (from Murayama-san) drivers. As another comment in this thread has mentioned, a full scrub can be a serious test of your hardware depending on how much data you've got to walk over. If you can keep the hardware variables to a minimum then clarity will be more achievable. thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Hi all, Unfortunately for me, there does seem to be a hardware component to my problem. Although my rsync copied almost 4TB of data with no iostat errors after going back to OpenSolaris 2009.06, I/O on one of my mpt cards did eventually hang, with 6 disk lights on and 2 off, until rebooting. There are a few hardware changes made since the last time I did a full backup, so it's possible that whatever problem was introduced didn't happen frequently enough in low i/o usage for me to detect until now when I was reinstalling and copying massive amounts of data back. The changes I had made since originally installing osol2009.06 several months ago are: - stop using marvel yukon2 ethernet onboard driver (which used a 3rd party driver) in favor of intel 1000 pt dual port, which necessesitated an extra pci-e slot, prompting the following item: - swapped motherboards between 2 machines (they were similiar though, with similiar onboard hardware and shouldn't have been a major change). Originally was an Asus P5Q Deluxe w/3 pci-e slots, now is a slightly older Asus P5W64 w/4 pci-e slots. - the intel 1000 pt dual port card has been aggregated as aggr0 since it was installed (the older yukon2 was a basic interface) the above changes were what was done awhile ago before upgrading opensolaris to 127, and things seemed to be working fine for at least 2-3 months with rsync updating (never hung, or had a fatal zfs error or lost access to data requiring a reboot) new changes since troubleshooting snv 127 mpt issues: - upgrade LSI 3081 firmware from 1.28.2 (or was it .02) to 1.29, the latest. If this turns out to be an issue, I do have the previous IT firmware that I was using before which I can flash back. another, albeit unlikely factor: when I originally copied all my data to my first opensolaris raidz2 pool, I didn't use rsync at all, I used netcat tar, and only setup rsync later for updates. perhaps the huge initial single rsync of the large tree does something strange that the original intiial netcat tar copy did not (i know, unlikely, but I'm grasping at straws here to determine what has happened). I'll work on ruling out the potential sources of hardware problems before I report any more on the mpt issues, since my test case would probably confound things at this point. I am affected by the mpt bugs since I would get the timeouts almost constantly in snv 127+, but since I'm also apparently affected by some other unknown hardware issue, my data on the mpt problems might lead people in the wrong direction at this point. I will first try to go back to the non-aggregated yukon ethernet, remove the intel dual port pci-e network adapter, then if the problem persists try half of my drives on each LSI controller individually to confirm if one controller has a problem the other does not, or one drive in one set is causing a new problem to a particular controller. I hope to have some kind of answer at that point and not have to resort to motherboard swapping again. Chad On Thu, Dec 03, 2009 at 10:44:53PM -0800, Chad Cantwell wrote: I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors. One thing I found interesting, and I don't know if it's significant or not, is that under the recent builds and under 2009.06, I had run echo '::interrupts' | mdb -k to check the interrupts used. (I don't have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don't know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I'll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it's not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad
Re: [zfs-discuss] mpt errors on snv 127
I eventually performed a few more tests, adjusting some zfs tuning options which had no effect, and trying the itmpt driver which someone had said would work, and regardless my system would always freeze quite rapidly in snv 127 and 128a. Just to double check my hardware, I went back to the opensolaris 2009.06 release version, and everything is working fine. The system has been running a few hours and copied a lot of data and not had any trouble, mpt syslog events, or iostat errors. One thing I found interesting, and I don't know if it's significant or not, is that under the recent builds and under 2009.06, I had run echo '::interrupts' | mdb -k to check the interrupts used. (I don't have the printout handy for snv 127+, though). I have a dual port gigabit Intel 1000 P PCI-e card, which shows up as e1000g0 and e1000g1. In snv 127+, each of my e1000g devices shares an IRQ with my mpt devices (mpt0, mpt1) on the IRQ listing, whereas in opensolaris 2009.06, all 4 devices are on different IRQs. I don't know if this is significant, but most of my testing when I encountered errors was data transfer via the network, so it could have potentially been interfering with the mpt drivers when it was on the same IRQ. The errors did seem to be less frequent when the server I was copying from was linked at 100 instead of 1000 (one of my tests), but that is as likely to be a result of the slower zpool throughput as it is to be related to the network traffic. I'll probably stay with 2009.06 for now since it works fine for me, but I can try a newer build again once some more progress is made in this area and people want to see if its fixed (this machine is mainly to backup another array so it's not too big a deal to test later when the mpt drivers are looking better and wipe again in the event of problems) Chad On Tue, Dec 01, 2009 at 03:06:31PM -0800, Chad Cantwell wrote: To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of instability. If I were to start copying a lot of data to the filesystem again though, I'm sure it would generate errors and crash again. Chad On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: Well, ok, the msi=0 thing didn't help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30
Re: [zfs-discuss] mpt errors on snv 127
I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Well, ok, the msi=0 thing didn't help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
This is basically just a me too. I'm using different hardware but essentially the same problems. The relevant hardware I have is: --- SuperMicro MBD-H8Di3+-F-O motherboard with LSI 1068E onboard SuperMicro SC846E2-R900B 4U chassis with two LSI SASx36 expander chips on the backplane 24 Western Digital RE4-GP 2TB 7.2k RPM SATA drives --- I have two SFF-8087 to SFF-8087 cables running from the two ports on the motherboard (4 channels each) to two ports on the backplane, each port going to one of the LSI expander chips. The backplane has four additional ports which support cascading additional enclosures together, but I'm not making use of any of this at the moment. The machine is currently dead at the data center, and it's late, so if you want anything more from me, just let me know and I'll run stuff tomorrow on the machine. But otherwise, the behavior sounds the same as all of the other MPT reports recently. I was not seeing these types of problems with 2009.06, but also wanted to upgrade to get raidz3 support. Just tell me what other commands you might want output from to help diagnose the problem. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Chad Cantwell wrote: Hi, I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? MRJ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Mark Johnson wrote: Chad Cantwell wrote: Hi, I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? What's the earliest build someone has seen this problem? i.e. if we binary chop, has anyone seen it in b118? I have no idea if the old mpt drivers will work on a new kernel... But if someone wants to try... Something like the following should work... # first, I would work out of a test BE in case you # mess something up. beadm create test-be beadm activate test-be reboot # assuming your lasted BE is call snv127, mount it and backup # the stock mpt driver and conf file. beadm mount snv127 /mnt cp /mnt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf.orig cp /mnt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt.orig # see what builds are out there... pkg search /kernel/drv/amd64/mpt # There's probably an easier way to do this... # grab an older mpt. This will take a while since it's # not in it's own package and ckr has some dependencies # so it will pull in a bunch of other packages. # change out 118 with the build you want to grab. mkdir /tmp/mpt pkg image-create -f -F -a opensolaris.org=http://pkg.opensolaris.org/dev /tmp/mpt pkg -R /tmp/mpt/ install sunw...@0.5.11-0.118 cp /tmp/mpt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf cp /tmp/mpt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt rm -rf /tmp/mpt/ bootadm update-archive -R /mnt MRJ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
We actually tried this, although using sol10-version of mpt-driver. Surprisingly it didn't work :-) Yours Markus Kovero -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Mark Johnson Sent: 1. joulukuuta 2009 15:57 To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] mpt errors on snv 127 Mark Johnson wrote: Chad Cantwell wrote: Hi, I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? What's the earliest build someone has seen this problem? i.e. if we binary chop, has anyone seen it in b118? I have no idea if the old mpt drivers will work on a new kernel... But if someone wants to try... Something like the following should work... # first, I would work out of a test BE in case you # mess something up. beadm create test-be beadm activate test-be reboot # assuming your lasted BE is call snv127, mount it and backup # the stock mpt driver and conf file. beadm mount snv127 /mnt cp /mnt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf.orig cp /mnt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt.orig # see what builds are out there... pkg search /kernel/drv/amd64/mpt # There's probably an easier way to do this... # grab an older mpt. This will take a while since it's # not in it's own package and ckr has some dependencies # so it will pull in a bunch of other packages. # change out 118 with the build you want to grab. mkdir /tmp/mpt pkg image-create -f -F -a opensolaris.org=http://pkg.opensolaris.org/dev /tmp/mpt pkg -R /tmp/mpt/ install sunw...@0.5.11-0.118 cp /tmp/mpt/kernel/drv/mpt.conf /mnt/kernel/drv/mpt.conf cp /tmp/mpt/kernel/drv/amd64/mpt /mnt/kernel/drv/amd64/mpt rm -rf /tmp/mpt/ bootadm update-archive -R /mnt MRJ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
What's the earliest build someone has seen this problem? i.e. if we binary chop, has anyone seen it in b118? We have used every stable build from b118 up, as b118 was the first reliable one that could be used is a CIFS-heavy environment. The problem occurs on all of them. - Adam -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
If someone from Sun will confirm that it should work to use the mpt driver from 2009.06, I'd be willing to set up a BE and try it. I still have the snapshot from my 2009.06 install, so I should be able to mount that and grab the files easily enough. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Travis Tabbal wrote: If someone from Sun will confirm that it should work to use the mpt driver from 2009.06, I'd be willing to set up a BE and try it. I still have the snapshot from my 2009.06 install, so I should be able to mount that and grab the files easily enough. I tried, it doesn't work. It's interesting to note that the itmpt driver (much, much older) works just fine. It seems someone has gotten creative with the mpt driver's use of the DDI. -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
First I tried just upgrading to b127, that had a few issues besides the mpt driver. After that I did a clean install of b127, but no I don't have my osol2009.06 root still there. I wasn't sure how to install another copy and leave it there (I suspect it is possible, since I saw when doing upgrades it creates a second root environment, but my forte isn't solaris so I just reformatted the root device) On Tue, Dec 01, 2009 at 08:09:32AM -0500, Mark Johnson wrote: Chad Cantwell wrote: Hi, I was using for quite awhile OpenSolaris 2009.06 with the opensolaris-provided mpt driver to operate a zfs raidz2 pool of about ~20T and this worked perfectly fine (no issues or device errors logged for several months, no hanging). A few days ago I decided to reinstall with the latest OpenSolaris in order to take advantage of raidz3. Just to be clear... The same setup was working fine on osol2009.06, you upgraded to b127 and it started failing? Did you keep the osol2009.06 be around so you can reboot back to it? If so, have you tried the osol2009.06 mpt driver in the BE with the latest bits (make sure you make a backup copy of the mpt driver)? MRJ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
To update everyone, I did a complete zfs scrub, and it it generated no errors in iostat, and I have 4.8T of data on the filesystem so it was a fairly lengthy test. The machine also has exhibited no evidence of instability. If I were to start copying a lot of data to the filesystem again though, I'm sure it would generate errors and crash again. Chad On Tue, Dec 01, 2009 at 12:29:16AM -0800, Chad Cantwell wrote: Well, ok, the msi=0 thing didn't help after all. A few minutes after my last message a few errors showed up in iostat, and then in a few minutes more the machine was locked up hard... Maybe I will try just doing a scrub instead of my rsync process and see how that does. Chad On Tue, Dec 01, 2009 at 12:13:36AM -0800, Chad Cantwell wrote: I don't think the hardware has any problems, it only started having errors when I upgraded OpenSolaris. It's still working fine again now after a reboot. Actually, I reread one of your earlier messages, and I didn't realize at first when you said non-Sun JBOD that this didn't apply to me (in regards to the msi=0 fix) because I didn't realize JBOD was shorthand for an external expander device. Since I'm just using baremetal, and passive backplanes, I think the msi=0 fix should apply to me based on what you wrote earlier, anyway I've put set mpt:mpt_enable_msi = 0 now in /etc/system and rebooted as it was suggested earlier. I've resumed my rsync, and so far there have been no errors, but it's only been 20 minutes or so. I should have a good idea by tomorrow if this definitely fixed the problem (since even when the machine was not crashing it was tallying up iostat errors fairly rapidly) Thanks again for your help. Sorry for wasting your time if the previously posted workaround fixes things. I'll let you know tomorrow either way. Chad On Tue, Dec 01, 2009 at 05:57:28PM +1000, James C. McPherson wrote: Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Chad Cantwell wrote: Hi, Sorry for not replying to one of the already open threads on this topic; I've just joined the list for the purposes of this discussion and have nothing in my client to reply to yet. I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with 8 drives each. Are these disks internal to your server's chassis, or external in a jbod? If in a jbod, which one? Also, which cables are you using? thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Hi, Replied to your previous general query already, but in summary, they are in the server chassis. It's a Chenbro 16 hotswap bay case. It has 4 mini backplanes that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives per card). Chad On Tue, Dec 01, 2009 at 01:02:34PM +1000, James C. McPherson wrote: Chad Cantwell wrote: Hi, Sorry for not replying to one of the already open threads on this topic; I've just joined the list for the purposes of this discussion and have nothing in my client to reply to yet. I have an x86_64 opensolaris machine running on a Core 2 Quad Q9650 platform with two LSI SAS3081E-R PCI-E 8 port SAS controllers, with 8 drives each. Are these disks internal to your server's chassis, or external in a jbod? If in a jbod, which one? Also, which cables are you using? thankyou, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Chad Cantwell wrote: Hi, Replied to your previous general query already, but in summary, they are in the server chassis. It's a Chenbro 16 hotswap bay case. It has 4 mini backplanes that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives per card). Hi Chad, thanks for the followup. Just to confirm - you've got this Chenbro chassis connected to the actual server chassis (where the cpu is), or do you have the cpu inside the Chenbro chassis? thankyou, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
Hi, The Chenbro chassis contains everything - the motherboard/CPU, and the disks. As far as I know the chenbro backplanes are basically electrical jumpers that the LSI cards shouldn't be aware of. They pass through the SATA signals directly from SFF-8087 cables to the disks. Thanks, Chad On Tue, Dec 01, 2009 at 01:43:06PM +1000, James C. McPherson wrote: Chad Cantwell wrote: Hi, Replied to your previous general query already, but in summary, they are in the server chassis. It's a Chenbro 16 hotswap bay case. It has 4 mini backplanes that each connect via an SFF-8087 cable (1m) to my LSI cards (2 cables / 8 drives per card). Hi Chad, thanks for the followup. Just to confirm - you've got this Chenbro chassis connected to the actual server chassis (where the cpu is), or do you have the cpu inside the Chenbro chassis? thankyou, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] mpt errors on snv 127
After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: Nov 30 20:26:11 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:26:11 the-vault Disconnected command timeout for Target 10 Nov 30 20:59:12 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:59:12 the-vault mpt_send_handshake_msg task 3 failed Nov 30 20:59:13 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. Nov 30 20:59:13 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:59:13 the-vault mpt_config_space_init failed Nov 30 20:59:15 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:59:15 the-vault LSI PCI device (1000,) not supported. Nov 30 20:59:15 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:59:15 the-vault mpt_config_space_init failed Nov 30 20:59:15 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:32:17 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:17 the-vault mpt_send_handshake_msg task 4 failed Nov 30 21:32:18 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:18 the-vault LSI PCI device (1000,) not supported. Nov 30 21:32:18 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:18 the-vault mpt_config_space_init failed Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault LSI PCI device (1000,) not supported. Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault mpt_config_space_init failed Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault mpt_restart_ioc failed Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault Rejecting future commands Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@0/pci1000,3...@0 (mpt0): Nov 30 21:32:19 the-vault Disconnected command timeout for Target 14 Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked Nov 30 21:32:19 the-vault scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@3/pci111d,8...@0/pci111d,8...@1/pci1000,3...@0 (mpt1): Nov 30 21:32:19 the-vault rejecting command, throttle choked
Re: [zfs-discuss] mpt errors on snv 127
Chad Cantwell wrote: After another crash I checked the syslog and there were some different errors than the ones I saw previously during operation: ... Nov 30 20:59:13 the-vault LSI PCI device (1000,) not supported. ... Nov 30 20:59:13 the-vault mpt_config_space_init failed ... Nov 30 20:59:15 the-vault mpt_restart_ioc failed Nov 30 21:33:02 the-vault fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-8R, TYPE: Fault, VER: 1, SEVERITY: Major Nov 30 21:33:02 the-vault EVENT-TIME: Mon Nov 30 21:33:02 PST 2009 Nov 30 21:33:02 the-vault PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: the-vault Nov 30 21:33:02 the-vault SOURCE: eft, REV: 1.16 Nov 30 21:33:02 the-vault EVENT-ID: 7886cc0d-4760-60b2-e06a-8158c3334f63 Nov 30 21:33:02 the-vault DESC: The transmitting device sent an invalid request. Nov 30 21:33:02 the-vault Refer to http://sun.com/msg/PCIEX-8000-8R for more information. Nov 30 21:33:02 the-vault AUTO-RESPONSE: One or more device instances may be disabled Nov 30 21:33:02 the-vault IMPACT: Loss of services provided by the device instances associated with this fault Nov 30 21:33:02 the-vault REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s). Us e fmadm faulty to identify the devices or contact Sun for support. Sorry to have to tell you, but that HBA is dead. Or at least dying horribly. If you can't init the config space (that's the pci bus config space), then you've got about 1/2 the nails in the coffin hammered in. Then the failure to restart the IOC (io controller unit) == the rest of the lid hammered down. best regards, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss