Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA
just FYI this is from intel http://www.intel.com/support/motherboards/server/sb/CS-031831.htm Another observation: Oracle/Sun has move away from SATA to SAS in ZFS storage/Appliance If you want to go deeper take look these presentations http://www.scsita.org/sas_library/tutorials/ and other presentations on the site regards On 5/31/2012 12:45 PM, "Antonio S. Cofiño" wrote: Markus, After Jim's answer I have started to read bout the well known issue. Is it just mpt causing the errors or also mpt_sas? Both drivers are causing the reset storm (See my answer to Jim's e-mail). General consensus from various people: don't use SATA drives on SAS back- planes. Some SATA drives might work better, but there seems to be no guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. In the Paul Kraus's answer it mentions that Oracle support says (among other things) 4. the problem happens with SAS as well as SATA drives, but is much less frequent That means, that using SAS drives it will reduce the probability of the issue but no guarantee exists. General consensus from various people: don't use SATA drives on SAS back- planes. Some SATA drives might work better, but there seems to be no guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. Yes, may be the 'general consensus' is right but 'general consensus' said me to use hardware based raid solutions. But I started to do 'risky business' (as some vendors told me) using ZFS and have ended discovering how robust is ZFS for this kind of protocol errors. From my complete naive point of view it appears more a issue with the HBA's FW than a issue with SATA drives. With you answers I have make a lot of re-search helping me to learn new things. Please more comments and help are welcome (from some SAS expert?). Antonio -- Antonio S. Cofiño El 31/05/2012 18:04, Weber, Markus escribió: Antonio S. Cofiño wrote: [...] The system is a supermicro motherboard X8DTH-6F in a 4U chassis (SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1). It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2) each of them connected to a 4 different HBA (2x LSI 3081E-R (1068 chip) + 2x LSI SAS9200-8e (2008 chip)). This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS) + 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi HDS723020BLA642)) The issue arise when one of the disk starts to fail making long time accesses. After some time (minutes, but I'm not sure) all the disks, connected to the same HBA, start to report errors. This situation produce a general failure on the ZFS making the whole POOL unavailable. [...] Have been there and gave up at the end[1]. Could reproduce (even though it took a bit longer) under most Linux versions (incl. using latest LSI drivers) and LSI 3081E-R HBA. Is it just mpt causing the errors or also mpt_sas? In a lab environment the LSI 9200 HBA behaved better - I/O only dropped shortly and then continued on the other disks without generating errors. Had a lengthy Oracle case on this, but all proposed "workarounds" did not worked for me at all, which had been (some also from other forums) - disabling NCQ - allow-bus-device-reset=0; to /kernel/drv/sd.conf - set zfs:zfs_vdev_max_pending=1 - set mpt:mpt_enable_msi=0 - keep usage below 90% - no fmservices running and did temporarily did fmadm unload disk-transport or other disk access stuff (smartd?) - tried changing retries-timeout via sd-conf for the disks without any success and ended it doing via mdb At the end I knew the bad sector of the "bad" disk and by simply dd this sector once or twice to /dev/zero I could easily bring down the system/pool without any load on the disk system. General consensus from various people: don't use SATA drives on SAS back- planes. Some SATA drives might work better, but there seems to be no guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. Markus [1] Search for "What's wrong with LSI 3081 (1068) + expander + (bad) SATA disk?" ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- <>___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA
On May 31, 2012, at 9:45 AM, Antonio S. Cofiño wrote: > Markus, > > After Jim's answer I have started to read bout the well known issue. > >> Is it just mpt causing the errors or also mpt_sas? > > Both drivers are causing the reset storm (See my answer to Jim's e-mail). No. Resets are corrective actions that occur because of command timeouts. The cause is the command timeout. >> General consensus from various people: don't use SATA drives on SAS back- >> planes. Some SATA drives might work better, but there seems to be no >> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. > > In the Paul Kraus's answer it mentions that Oracle support says (among other > things) >>> 4. the problem happens with SAS as well as SATA drives, but is much >>> less frequent >> > > That means, that using SAS drives it will reduce the probability of the issue > but no guarantee exists. I have seen broken SAS drives crush expanders/HBAs such that POST would not run. Obviously, at this point there is no OS running, so we can't blame the OS or drivers. I have a few SATA disks that have the same affect on motherboards. I call them, the "drives of doom" :-) >> General consensus from various people: don't use SATA drives on SAS back- >> planes. Some SATA drives might work better, but there seems to be no >> guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. > > Yes, may be the 'general consensus' is right but 'general consensus' said me > to use hardware based raid solutions. But I started to do 'risky business' > (as some vendors told me) using ZFS and have ended discovering how robust is > ZFS for this kind of protocol errors. "hardware" RAID solutions are also susceptible to these failure modes. > From my complete naive point of view it appears more a issue with the HBA's > FW than a issue with SATA drives. There are multiple contributors. But perhaps the most difficult to overcome is the fundamental differences in the SAS and SATA protocols. Let's just agree that SATA was not designed for network-like fabrics. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA
Markus, After Jim's answer I have started to read bout the well known issue. Is it just mpt causing the errors or also mpt_sas? Both drivers are causing the reset storm (See my answer to Jim's e-mail). General consensus from various people: don't use SATA drives on SAS back- planes. Some SATA drives might work better, but there seems to be no guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. In the Paul Kraus's answer it mentions that Oracle support says (among other things) 4. the problem happens with SAS as well as SATA drives, but is much less frequent That means, that using SAS drives it will reduce the probability of the issue but no guarantee exists. General consensus from various people: don't use SATA drives on SAS back- planes. Some SATA drives might work better, but there seems to be no guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. Yes, may be the 'general consensus' is right but 'general consensus' said me to use hardware based raid solutions. But I started to do 'risky business' (as some vendors told me) using ZFS and have ended discovering how robust is ZFS for this kind of protocol errors. From my complete naive point of view it appears more a issue with the HBA's FW than a issue with SATA drives. With you answers I have make a lot of re-search helping me to learn new things. Please more comments and help are welcome (from some SAS expert?). Antonio -- Antonio S. Cofiño El 31/05/2012 18:04, Weber, Markus escribió: Antonio S. Cofiño wrote: [...] The system is a supermicro motherboard X8DTH-6F in a 4U chassis (SC847E1-R1400LPB) and an external SAS2 JBOD (SC847E16-RJBOD1). It makes a system with a total of 4 backplanes (2x SAS + 2x SAS2) each of them connected to a 4 different HBA (2x LSI 3081E-R (1068 chip) + 2x LSI SAS9200-8e (2008 chip)). This system is has a total of 81 disk (2x SAS (SEAGATE ST3146356SS) + 34 SATA3 (Hitachi HDS722020ALA330) + 45 SATA6 (Hitachi HDS723020BLA642)) The issue arise when one of the disk starts to fail making long time accesses. After some time (minutes, but I'm not sure) all the disks, connected to the same HBA, start to report errors. This situation produce a general failure on the ZFS making the whole POOL unavailable. [...] Have been there and gave up at the end[1]. Could reproduce (even though it took a bit longer) under most Linux versions (incl. using latest LSI drivers) and LSI 3081E-R HBA. Is it just mpt causing the errors or also mpt_sas? In a lab environment the LSI 9200 HBA behaved better - I/O only dropped shortly and then continued on the other disks without generating errors. Had a lengthy Oracle case on this, but all proposed "workarounds" did not worked for me at all, which had been (some also from other forums) - disabling NCQ - allow-bus-device-reset=0; to /kernel/drv/sd.conf - set zfs:zfs_vdev_max_pending=1 - set mpt:mpt_enable_msi=0 - keep usage below 90% - no fmservices running and did temporarily did fmadm unload disk-transport or other disk access stuff (smartd?) - tried changing retries-timeout via sd-conf for the disks without any success and ended it doing via mdb At the end I knew the bad sector of the "bad" disk and by simply dd this sector once or twice to /dev/zero I could easily bring down the system/pool without any load on the disk system. General consensus from various people: don't use SATA drives on SAS back- planes. Some SATA drives might work better, but there seems to be no guarantee. And even for SAS-SAS, try to avoid SAS1 backplanes. Markus [1] Search for "What's wrong with LSI 3081 (1068) + expander + (bad) SATA disk?" ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk failure chokes all the disks attached to thefailingdisk HBA
Jim, Thank you for the explanation. I have 'discovered' that is a typical situation that makes the system unstable. Just for curiosity, this morning it happened again. Below, you can che the log oupu. This time a HBA with LSI 1068E Chip, mpt driver, the previous one was with a LSI 2008, mpt_sas driver. In this case the ZFS 'dicovered' the error and it was able to self healing, and the system is working smooth. Antonio May 31 10:48:11 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:11 seal.macc.unican.es mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31123000 May 31 10:48:11 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:11 seal.macc.unican.es mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31123000 May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for target 12. May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for target 12. May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for target 12. May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc May 31 10:48:13 seal.macc.unican.es scsi: [ID 365881 kern.info] /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:13 seal.macc.unican.es Log info 0x31123000 received for target 12. May 31 10:48:13 seal.macc.unican.es scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:16 seal.macc.unican.es mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x3000 May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:16 seal.macc.unican.es mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x3000 May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:16 seal.macc.unican.es mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31112000 May 31 10:48:16 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:16 seal.macc.unican.es mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31112000 May 31 10:48:17 seal.macc.unican.es scsi: [ID 365881 kern.info] /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:17 seal.macc.unican.es Log info 0x3000 received for target 12. May 31 10:48:17 seal.macc.unican.es scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc May 31 10:48:20 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:20 seal.macc.unican.es SAS Discovery Error on port 0. DiscoveryStatus is DiscoveryStatus is |Unaddressable device found| May 31 10:48:22 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:22 seal.macc.unican.es mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31123000 May 31 10:48:22 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:22 seal.macc.unican.es mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31123000 May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:27 seal.macc.unican.es mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x3000 May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:27 seal.macc.unican.es mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x3000 May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:27 seal.macc.unican.es mpt_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31112000 May 31 10:48:27 seal.macc.unican.es scsi: [ID 243001 kern.warning] WARNING: /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:27 seal.macc.unican.es mpt_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31112000 May 31 10:48:28 seal.macc.unican.es scsi: [ID 365881 kern.info] /pci@7a,0/pci8086,3410@9/pci1000,3140@0 (mpt2): May 31 10:48:28 se