Even though it's not directly ZFS related, I've seen some similar discussion on this list and maybe someone has "the final" answer to this problem, as most tips and "these things could help" I have found so far have not fully solved the problem.
We are struggling with the behaviour of the combination LSI 3081E-R and SATA disks behind an expander. One disk behind the expander is known to be bad. DDing from that disk causes I/O to other (good) disks to fail soon (Solaris) or later (Linux), but for sure it will fail and make the system unusable. Under Linux after some time (maybe when certain things come together) a few LogInfo(0x31123000) will shortly interrupt I/O to other disks, but then more and more of these logs show up making any kind of I/O to disks behind the expander impossible. Under Solaris it doesn't even take that long: reading once or twice from the bad disk, then I/O to other disks mostly immediately fail (and it looks like sometimes the HBA/bus(?) re-initialize completely). The error code 0x31123000 is SAS(3) + PL(1) + PL_LOGINFO_CODE_ABORT(12) + PL_LOGINFO_SUB_CODE_BREAK_ON_STUCK_LINK (3000) - I guess this relates to the HBA -- expander link(s) not being up/re-established/??? at that time and the HBA is maybe just not waiting long enough (but how long is long enough and not too long?)??? I'm trying to understand a bit better - why and who is triggering this (e.g. because mpt sends a reset bus/target) - what exactly is going on (e.g. a reset is sent, the HBA -- expander link takes too long or sees problems and then gets caught in a reset loop) - if this "as-per-design" (e.g. SATA disks behind expanders are always toxic) - if this problem could be pinpointed to one instance (like it's the HBA's FW or the expander FW) or a combination of things (like WD drives acts strange, causing problems with the expander or so). - any ideas to pinpoint the problem and get a clearer picture of the issue. I did some quick, preliminary other tests, which let me think, it's most likely a "fatal" LSI3081--expander problem, but I could be wrong: - Moving the bad disk away from the expander to another port on the same HBA: When reading from the bad disk (not behind the expander), I/O to other disks (behind the expander) seems to be not affected at all. - Replacing the 3081 with a 9211, keeping the bad disk behind the expander: When reading from the bad disk, I/O to other disks seems to be shortly stopped, but continues quickly and no errors for the "good" disks are seen so far (at least under Solaris 11 booted from a LiveCD) - still not perfect, but better .. I do have an Oracle case on this, but -even though learned a few things- with no real result (it's not Oracle HW). WD was so kind to provide quickly the latest FW for the drives, but not more so far and LSI is ... well, they take their time and gave as first reply "no, we are not aware of any issues like this" (strange, there are quite a bunch of postings about this out there). Many thanks for sharing your experience or ideas on this. Markus PS: LSI 3081E-R (SAS1068E B3), running 01.33.00.00 / 6.36.00.00; expander backplane is SC216-E1 (SASX36 A1 7015) and WD3000BLFS FW 04.04V06. Solaris 10 & 11, OpenSolaris 134, OpenIndina 151a, CentOS 6.2 with 3.04.19 MPT, OpenSuse with 11.1 & 4.00.43.00suse MPT and/or latest LSI drivers 4.28.00.00 ... -- KPN International Darmstädter Landstrasse 184 | 60598 Frankfurt | Germany [T] +49 (0)69 96874-298 | [F] -289 | [M] +49 (0)178 5352346 [E] <markus.we...@kpn.de> | [W] www.kpn.de KPN International ist ein eingetragenes Markenzeichen der KPN EuroRings B.V. KPN Eurorings B.V. | Niederlassung Frankfurt am Main Amtsgericht Frankfurt HRB56874 | USt.IdNr. DE 225602449 Geschäftsführer Jacobus Snijder & Louis Rustenhoven _______________________________________________ zfs-discuss mailing list firstname.lastname@example.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss